Skill

Extract Structured Data from Web Pages

Name: Extract Structured Data from Web Pages
Availability: OnlineOnly
Author: LlamaIndex

Extract structured data from web pages using AgentQL queries or natural language. Integrates with LlamaIndex and Playwright for web scraping and browser

Get skill

Works with playwrightagentql

LlamaIndex

Maintainer?

Spark score

out of 100

Updated 3 months ago

Version 1.0.0

Models

llama 3

Add to Favorites

Why it matters

Automate the extraction of structured data from any web page, either via REST API or directly from a browser. This tool enables robust web scraping and data retrieval that remains resilient to website changes.

Outcomes

What it gets done

Extract structured data from a given URL using AgentQL queries or natural language prompts.

Extract structured data from the active browser page using AgentQL queries or natural language prompts.

Locate specific web elements within a browser page using natural language descriptions.

Integrate with Playwright for browser automation and interaction.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/li-tool-tools-agentql | bash

Capabilities

What this skill does

Extract

Pulls structured data fields from unstructured text.

Scrape

Fetches and parses content from web pages.

Search the web

Searches the web and retrieves relevant sources.

Drive a browser

Controls a real browser to automate web workflows.

Overview

Llama Index Tools Agentql

What it does

This toolset provides AI assistants with the capability to interact with web pages and extract structured data. It offers functions to fetch data from a specified URL using a REST API or to extract information from the currently active browser tab. Additionally, it can locate specific web elements within a browser page using natural language descriptions.

How it connects

Utilize this tool when your AI assistant needs to gather specific information from websites, such as article content, product details, or user reviews, either through direct API calls or by simulating user interaction in a browser. It is designed for scenarios requiring robust and resilient web data extraction.

Source README

llama-index-tools-agentql

AgentQL provides web interaction and structured data extraction from any web page using an AgentQL query or a Natural Language prompt. AgentQL can be used across multiple languages and web pages without breaking over time and change.

Warning
Only supports async functions and playwright browser APIs, please refer to the following PR for more details: https://github.com/run-llama/llama_index/pull/17808

Installation

pip install llama-index-tools-agentql

You also need to configure the AGENTQL_API_KEY environment variable. You can acquire an API key from our Dev Portal.

Overview

AgentQL provides the following three function tools:

extract_web_data_with_rest_api: Extracts structured data as JSON from a web page given a URL using either an AgentQL query or a Natural Language description of the data.
extract_web_data_from_browser: Extracts structured data as JSON from the active web page in a browser using either an AgentQL query or a Natural Language description. This tool must be used with a Playwright browser.
get_web_element_from_browser: Finds a web element on the active web page in a browser using a Natural Language description and returns its CSS selector for further interaction. This tool must be used with a Playwright browser.

You can learn more about how to use AgentQL tools in this Jupyter notebook.

Extract data using REST API

from llama_index.tools.agentql import AgentQLRestAPIToolSpec

agentql_rest_api_tool = AgentQLRestAPIToolSpec()
await agentql_rest_api_tool.extract_web_data_with_rest_api(
    url="https://www.agentql.com/blog",
    query="{ posts[] { title url author date }}",
)

Work with data and web elements using browser

Setup

In order to use the extract_web_data_from_browser and get_web_element_from_browser, you need to have a Playwright browser instance. If you do not have an active instance, you can initiate one using the create_async_playwright_browser utility method from LlamaIndex's Playwright ToolSpec.

Note
AgentQL browser tools are best used along with LlamaIndex's Playwright tools.

from llama_index.tools.playwright.base import PlaywrightToolSpec

async_browser = await PlaywrightToolSpec.create_async_playwright_browser()

You can also use an existing browser instance via Chrome DevTools Protocol (CDP) connection URL:

p = await async_playwright().start()
async_browser = await p.chromium.connect_over_cdp("CDP_CONNECTION_URL")

Extract data from the active browser page

from llama_index.tools.agentql import AgentQLBrowserToolSpec

playwright_tool = PlaywrightToolSpec(async_browser=async_browser)
await playwright_tool.navigate_to("https://www.agentql.com/blog")

agentql_browser_tool = AgentQLBrowserToolSpec(async_browser=async_browser)
await agentql_browser_tool.extract_web_data_from_browser(
    prompt="the blog posts with title and url",
)

Find a web element on the active browser page

next_page_button = await agentql_browser_tool.get_web_element_from_browser(
    prompt="The next page navigation button",
)

await playwright_tool.click(next_page_button)

Agentic Usage

This tool has a more extensive example for agentic usage documented in this Jupyter notebook.

Run tests

In order to run integration tests, you need to configure L

Discussion

Extract Structured Data from Web Pages

What it gets done

Add it to your toolbox

What this skill does

Llama Index Tools Agentql

What it does

How it connects

llama-index-tools-agentql

Installation

Overview

Extract data using REST API

Work with data and web elements using browser

Setup

Extract data from the active browser page

Find a web element on the active browser page

Agentic Usage

Run tests

Questions & comments · 0