Skill

Scrape and Extract Web Data with Bright Data

Integrate Bright Data's web scraping and data extraction tools with LlamaIndex for reliable web crawling and structured data retrieval.

Works with brightdatalinkedinamazoninstagramfacebook

91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Leverage Bright Data's robust web scraping capabilities to reliably extract structured data from websites, search engines, and social media platforms, bypassing CAPTCHA and bot detection.

Outcomes

What it gets done

01

Scrape web pages and convert content to Markdown, bypassing bot detection.

02

Take screenshots of web pages for visual capture.

03

Perform targeted web searches on Google, Bing, or Yandex with structured results.

04

Extract structured data from platforms like LinkedIn, Amazon, and social media.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/li-tool-tools-brightdata | bash

Capabilities

What this skill does

Scrape

Fetches and parses content from web pages.

Search the web

Searches the web and retrieves relevant sources.

Extract

Pulls structured data fields from unstructured text.

Summarize

Condenses long documents or threads into key takeaways.

Overview

LlamaIndex Tools Integration: Bright Data

What it does

LlamaIndex integration for Bright Data web scraping and data extraction

How it connects

When you need to give your AI agent reliable access to web content, search results, or structured data from specific platforms like LinkedIn and Amazon

Source README

LlamaIndex Tools Integration: Bright Data

This tool connects to Bright Data to enable your agent to crawl websites, search the web, and access structured data from platforms like LinkedIn, Amazon, and social media.

Bright Data's tools provide robust web scraping capabilities with built-in CAPTCHA solving and bot detection avoidance, allowing you to reliably extract data from the web.

Installation

pip install llama-index llama-index-core llama-index-tools-brightdata

Authentication

Sign up at Bright Data and retrieve your API key from your account settings. Replace "your-api-key" with your actual API key in the examples below:

Usage

Here's an example of how to use the BrightDataToolSpec with LlamaIndex:

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
from llama_index.tools.brightdata import BrightDataToolSpec

brightdata_tool = BrightDataToolSpec(api_key="your-api-key", zone="unlocker")

tool_list = brightdata_tool.to_tool_list()

for tool in tool_list:
    tool.original_description = tool.metadata.description
    tool.metadata.description = "Bright Data web scraping tool"

agent = FunctionAgent(
    tools=tool_list,
    llm=OpenAI(model="gpt-4.1"),
)

query = (
    "Find and summarize the latest news about AI from major tech news sites"
)
tool_descriptions = "\n\n".join(
    [
        f"Tool Name: {tool.metadata.name}\nTool Description: {tool.original_description}"
        for tool in tool_list
    ]
)

query_with_descriptions = f"{tool_descriptions}\n\nQuery: {query}"

response = await agent.run(query_with_descriptions)
print(response)

Features

The Bright Data tool provides the following capabilities:

Web Scraping

  • scrape_as_markdown: Scrape a webpage and convert the content to Markdown format. This tool can bypass CAPTCHA and bot detection.
result = brightdata_tool.scrape_as_markdown("https://example.com")
print(result.text)

Visual Capture

  • get_screenshot: Take a screenshot of a webpage and save it to a file.
screenshot_path = brightdata_tool.get_screenshot(
    "https://example.com", output_path="example_screenshot.png"
)

Search Engine Access

  • search_engine: Search Google, Bing, or Yandex and get structured search results as JSON or Markdown. Supports advanced parameters for more specific searches.
search_results = brightdata_tool.search_engine(
    query="climate change solutions",
    engine="google",
    language="en",
    country_code="us",
    num_results=20,
)
print(search_results.text)

Structured Web Data Extraction

  • web_data_feed: Retrieve structured data from various platforms including LinkedIn, Amazon, Instagram, Facebook, X (Twitter), Zillow, and more.
linkedin_profile = brightdata_tool.web_data_feed(
    source_type="linkedin_person_profile",
    url="https://www.linkedin.com/in/username/",
)
print(linkedin_profile)

amazon_product = brightdata_tool.web_data_feed(
    source_type="amazon_product", url="https://www.amazon.com/dp/B08N5KWB9H"
)
print(amazon_product)

Advanced Configuration

The Bright Data tool offers various configuration options for specialized use cases:

Search Engine Parameters

The search_engine function supports advanced parameters like:

  • Language targeting (language parameter)
  • Country-specific search (country_code parameter)
  • Different search types (images, shopping, news, etc.)
  • Pagination controls
  • Mobile device emulation
  • Geolocation targeting
  • Hotel search parameters
results = brightdata_tool.search_engine(
    query="best hotels in paris",
    engine="google",
    language="fr",
    country_code="fr",
    search_type="shopping",
    device="mobile",
    hotel_dates="2025-06-01,2025-06-05",
    hotel_occupancy=2,
)

Supp

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.