Skill

Scrape and Extract Web Data with Bright Data

Name: Scrape and Extract Web Data with Bright Data
Availability: OnlineOnly
Author: LlamaIndex

Integrate Bright Data's web scraping and data extraction tools with LlamaIndex for reliable web crawling and structured data retrieval.

Get skill

Works with brightdatalinkedinamazoninstagramfacebook

LlamaIndex

Maintainer?

Spark score

out of 100

Updated 3 months ago

Version 1.0.0

Models

gpt 4ogpt 4llama 3

Add to Favorites

Why it matters

Leverage Bright Data's robust web scraping capabilities to reliably extract structured data from websites, search engines, and social media platforms, bypassing CAPTCHA and bot detection.

Outcomes

What it gets done

Scrape web pages and convert content to Markdown, bypassing bot detection.

Take screenshots of web pages for visual capture.

Perform targeted web searches on Google, Bing, or Yandex with structured results.

Extract structured data from platforms like LinkedIn, Amazon, and social media.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/li-tool-tools-brightdata | bash

Capabilities

What this skill does

Scrape

Fetches and parses content from web pages.

Search the web

Searches the web and retrieves relevant sources.

Extract

Pulls structured data fields from unstructured text.

Summarize

Condenses long documents or threads into key takeaways.

Overview

LlamaIndex Tools Integration: Bright Data

What it does

LlamaIndex integration for Bright Data web scraping and data extraction

How it connects

When you need to give your AI agent reliable access to web content, search results, or structured data from specific platforms like LinkedIn and Amazon

Source README

LlamaIndex Tools Integration: Bright Data

This tool connects to Bright Data to enable your agent to crawl websites, search the web, and access structured data from platforms like LinkedIn, Amazon, and social media.

Bright Data's tools provide robust web scraping capabilities with built-in CAPTCHA solving and bot detection avoidance, allowing you to reliably extract data from the web.

Installation

pip install llama-index llama-index-core llama-index-tools-brightdata

Authentication

Sign up at Bright Data and retrieve your API key from your account settings. Replace "your-api-key" with your actual API key in the examples below:

Usage

Here's an example of how to use the BrightDataToolSpec with LlamaIndex:

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
from llama_index.tools.brightdata import BrightDataToolSpec

brightdata_tool = BrightDataToolSpec(api_key="your-api-key", zone="unlocker")

tool_list = brightdata_tool.to_tool_list()

for tool in tool_list:
    tool.original_description = tool.metadata.description
    tool.metadata.description = "Bright Data web scraping tool"

agent = FunctionAgent(
    tools=tool_list,
    llm=OpenAI(model="gpt-4.1"),
)

query = (
    "Find and summarize the latest news about AI from major tech news sites"
)
tool_descriptions = "\n\n".join(
    [
        f"Tool Name: {tool.metadata.name}\nTool Description: {tool.original_description}"
        for tool in tool_list
    ]
)

query_with_descriptions = f"{tool_descriptions}\n\nQuery: {query}"

response = await agent.run(query_with_descriptions)
print(response)

Features

The Bright Data tool provides the following capabilities:

Web Scraping

scrape_as_markdown: Scrape a webpage and convert the content to Markdown format. This tool can bypass CAPTCHA and bot detection.

result = brightdata_tool.scrape_as_markdown("https://example.com")
print(result.text)

Visual Capture

get_screenshot: Take a screenshot of a webpage and save it to a file.

screenshot_path = brightdata_tool.get_screenshot(
    "https://example.com", output_path="example_screenshot.png"
)

Search Engine Access

search_engine: Search Google, Bing, or Yandex and get structured search results as JSON or Markdown. Supports advanced parameters for more specific searches.

search_results = brightdata_tool.search_engine(
    query="climate change solutions",
    engine="google",
    language="en",
    country_code="us",
    num_results=20,
)
print(search_results.text)

Structured Web Data Extraction

web_data_feed: Retrieve structured data from various platforms including LinkedIn, Amazon, Instagram, Facebook, X (Twitter), Zillow, and more.

linkedin_profile = brightdata_tool.web_data_feed(
    source_type="linkedin_person_profile",
    url="https://www.linkedin.com/in/username/",
)
print(linkedin_profile)

amazon_product = brightdata_tool.web_data_feed(
    source_type="amazon_product", url="https://www.amazon.com/dp/B08N5KWB9H"
)
print(amazon_product)

Advanced Configuration

The Bright Data tool offers various configuration options for specialized use cases:

Search Engine Parameters

The search_engine function supports advanced parameters like:

Language targeting (language parameter)
Country-specific search (country_code parameter)
Different search types (images, shopping, news, etc.)
Pagination controls
Mobile device emulation
Geolocation targeting
Hotel search parameters

results = brightdata_tool.search_engine(
    query="best hotels in paris",
    engine="google",
    language="fr",
    country_code="fr",
    search_type="shopping",
    device="mobile",
    hotel_dates="2025-06-01,2025-06-05",
    hotel_occupancy=2,
)

Supp

Discussion

Scrape and Extract Web Data with Bright Data

What it gets done

Add it to your toolbox

What this skill does

LlamaIndex Tools Integration: Bright Data

What it does

How it connects

LlamaIndex Tools Integration: Bright Data

Installation

Authentication

Usage

Features

Web Scraping

Visual Capture

Search Engine Access

Structured Web Data Extraction

Advanced Configuration

Search Engine Parameters

Supp

Questions & comments · 0