Load Web Data with Apify Actors
Load data from Apify Actors or Datasets into LlamaIndex.
Why it matters
Integrate with Apify's powerful web scraping platform to extract and load data from web pages or existing datasets directly into your LlamaIndex or LangChain applications.
Outcomes
What it gets done
Run pre-built Apify Actors to crawl and extract content from websites.
Load data from existing Apify datasets.
Prepare scraped data for use in LLM applications via LlamaIndex.
Utilize Apify data within LangChain agents.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-reader-readers-apify | bash Capabilities
What this skill does
Fetches and parses content from web pages.
Pulls structured data fields from unstructured text.
Chunks, embeds, and indexes documents for semantic retrieval.
Searches the web and retrieves relevant sources.
Overview
Apify Loaders
What it does
This integration provides two loaders for interacting with Apify, a cloud platform for web scraping and data extraction. The Apify Actor Loader runs a specified Apify Actor and loads its results, while the Apify Dataset Loader retrieves data from an existing Apify dataset.
How it connects
Use these loaders when you need to ingest data scraped or extracted by Apify Actors into your LlamaIndex or LangChain applications. A prime example is using the Website Content Crawler Actor to gather text content from websites like documentation, knowledge bases, or blogs, which can then be used to answer questions with a language model.
Source README
Apify Loaders
pip install llama-index-readers-apify
Apify Actor Loader
Apify is a cloud platform for web scraping and data extraction,
which provides an ecosystem of more than a thousand
ready-made apps called Actors for various scraping, crawling, and extraction use cases.
This loader runs a specific Actor and loads its results.
Usage
In this example, we’ll use the Website Content Crawler Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your Apify API token in the code.
from llama_index.core import Document
from llama_index.readers.apify import ApifyActor
reader = ApifyActor("<My Apify API token>")
documents = reader.load_data(
actor_id="apify/website-content-crawler",
run_input={
"startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}]
},
dataset_mapping_function=lambda item: Document(
text=item.get("text"),
metadata={
"url": item.get("url"),
},
),
)
This loader is designed to be used as a way to load data into
LlamaIndex and/or subsequently
used as a Tool in a LangChain Agent.
Apify Dataset Loader
Apify is a cloud platform for web scraping and data extraction,
which provides an ecosystem of more than a thousand
ready-made apps called Actors for various scraping, crawling, and extraction use cases.
This loader loads documents from an existing Apify dataset.
Usage
In this example, we’ll load a dataset generated by
the Website Content Crawler Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your Apify API token in the code.
from llama_index.core import Document
from llama_index.readers.apify import ApifyDataset
reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
dataset_id="<Apify Dataset ID>",
dataset_mapping_function=lambda item: Document(
text=item.get("text"),
metadata={
"url": item.get("url"),
},
),
)
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.