Scrape and Extract Web Data with Bright Data
Integrate Bright Data's web scraping and data extraction tools with LlamaIndex for reliable web crawling and structured data retrieval.
Why it matters
Leverage Bright Data's robust web scraping capabilities to reliably extract structured data from websites, search engines, and social media platforms, bypassing CAPTCHA and bot detection.
Outcomes
What it gets done
Scrape web pages and convert content to Markdown, bypassing bot detection.
Take screenshots of web pages for visual capture.
Perform targeted web searches on Google, Bing, or Yandex with structured results.
Extract structured data from platforms like LinkedIn, Amazon, and social media.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-tool-tools-brightdata | bash Capabilities
What this skill does
Fetches and parses content from web pages.
Searches the web and retrieves relevant sources.
Pulls structured data fields from unstructured text.
Condenses long documents or threads into key takeaways.
Overview
LlamaIndex Tools Integration: Bright Data
What it does
LlamaIndex integration for Bright Data web scraping and data extraction
How it connects
When you need to give your AI agent reliable access to web content, search results, or structured data from specific platforms like LinkedIn and Amazon
Source README
LlamaIndex Tools Integration: Bright Data
This tool connects to Bright Data to enable your agent to crawl websites, search the web, and access structured data from platforms like LinkedIn, Amazon, and social media.
Bright Data's tools provide robust web scraping capabilities with built-in CAPTCHA solving and bot detection avoidance, allowing you to reliably extract data from the web.
Installation
pip install llama-index llama-index-core llama-index-tools-brightdata
Authentication
Sign up at Bright Data and retrieve your API key from your account settings. Replace "your-api-key" with your actual API key in the examples below:
Usage
Here's an example of how to use the BrightDataToolSpec with LlamaIndex:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
from llama_index.tools.brightdata import BrightDataToolSpec
brightdata_tool = BrightDataToolSpec(api_key="your-api-key", zone="unlocker")
tool_list = brightdata_tool.to_tool_list()
for tool in tool_list:
tool.original_description = tool.metadata.description
tool.metadata.description = "Bright Data web scraping tool"
agent = FunctionAgent(
tools=tool_list,
llm=OpenAI(model="gpt-4.1"),
)
query = (
"Find and summarize the latest news about AI from major tech news sites"
)
tool_descriptions = "\n\n".join(
[
f"Tool Name: {tool.metadata.name}\nTool Description: {tool.original_description}"
for tool in tool_list
]
)
query_with_descriptions = f"{tool_descriptions}\n\nQuery: {query}"
response = await agent.run(query_with_descriptions)
print(response)
Features
The Bright Data tool provides the following capabilities:
Web Scraping
scrape_as_markdown: Scrape a webpage and convert the content to Markdown format. This tool can bypass CAPTCHA and bot detection.
result = brightdata_tool.scrape_as_markdown("https://example.com")
print(result.text)
Visual Capture
get_screenshot: Take a screenshot of a webpage and save it to a file.
screenshot_path = brightdata_tool.get_screenshot(
"https://example.com", output_path="example_screenshot.png"
)
Search Engine Access
search_engine: Search Google, Bing, or Yandex and get structured search results as JSON or Markdown. Supports advanced parameters for more specific searches.
search_results = brightdata_tool.search_engine(
query="climate change solutions",
engine="google",
language="en",
country_code="us",
num_results=20,
)
print(search_results.text)
Structured Web Data Extraction
web_data_feed: Retrieve structured data from various platforms including LinkedIn, Amazon, Instagram, Facebook, X (Twitter), Zillow, and more.
linkedin_profile = brightdata_tool.web_data_feed(
source_type="linkedin_person_profile",
url="https://www.linkedin.com/in/username/",
)
print(linkedin_profile)
amazon_product = brightdata_tool.web_data_feed(
source_type="amazon_product", url="https://www.amazon.com/dp/B08N5KWB9H"
)
print(amazon_product)
Advanced Configuration
The Bright Data tool offers various configuration options for specialized use cases:
Search Engine Parameters
The search_engine function supports advanced parameters like:
- Language targeting (
languageparameter) - Country-specific search (
country_codeparameter) - Different search types (images, shopping, news, etc.)
- Pagination controls
- Mobile device emulation
- Geolocation targeting
- Hotel search parameters
results = brightdata_tool.search_engine(
query="best hotels in paris",
engine="google",
language="fr",
country_code="fr",
search_type="shopping",
device="mobile",
hotel_dates="2025-06-01,2025-06-05",
hotel_occupancy=2,
)
Supp
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.