Skill

Load Web Data with Apify Actors

Load data from Apify Actors or Datasets into LlamaIndex.

Works with apifyllamaindexlangchain

57
Spark score
out of 100
Updated 4 days ago
Version 0.14.22

Add to Favorites

Why it matters

Integrate with Apify's powerful web scraping platform to extract and load data from web pages or existing datasets directly into your LlamaIndex or LangChain applications.

Outcomes

What it gets done

01

Run pre-built Apify Actors to crawl and extract content from websites.

02

Load data from existing Apify datasets.

03

Prepare scraped data for use in LLM applications via LlamaIndex.

04

Utilize Apify data within LangChain agents.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/li-reader-readers-apify | bash

Capabilities

What this skill does

Scrape

Fetches and parses content from web pages.

Extract

Pulls structured data fields from unstructured text.

RAG index

Chunks, embeds, and indexes documents for semantic retrieval.

Search the web

Searches the web and retrieves relevant sources.

Overview

Apify Loaders

What it does

This integration provides two loaders for interacting with Apify, a cloud platform for web scraping and data extraction. The Apify Actor Loader runs a specified Apify Actor and loads its results, while the Apify Dataset Loader retrieves data from an existing Apify dataset.

How it connects

Use these loaders when you need to ingest data scraped or extracted by Apify Actors into your LlamaIndex or LangChain applications. A prime example is using the Website Content Crawler Actor to gather text content from websites like documentation, knowledge bases, or blogs, which can then be used to answer questions with a language model.

Source README

Apify Loaders

pip install llama-index-readers-apify

Apify Actor Loader

Apify is a cloud platform for web scraping and data extraction,
which provides an ecosystem of more than a thousand
ready-made apps called Actors for various scraping, crawling, and extraction use cases.

This loader runs a specific Actor and loads its results.

Usage

In this example, we’ll use the Website Content Crawler Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your Apify API token in the code.

from llama_index.core import Document
from llama_index.readers.apify import ApifyActor

reader = ApifyActor("<My Apify API token>")

documents = reader.load_data(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}]
    },
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={
            "url": item.get("url"),
        },
    ),
)

This loader is designed to be used as a way to load data into
LlamaIndex and/or subsequently
used as a Tool in a LangChain Agent.

Apify Dataset Loader

Apify is a cloud platform for web scraping and data extraction,
which provides an ecosystem of more than a thousand
ready-made apps called Actors for various scraping, crawling, and extraction use cases.

This loader loads documents from an existing Apify dataset.

Usage

In this example, we’ll load a dataset generated by
the Website Content Crawler Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your Apify API token in the code.

from llama_index.core import Document
from llama_index.readers.apify import ApifyDataset

reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
    dataset_id="<Apify Dataset ID>",
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={
            "url": item.get("url"),
        },
    ),
)

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.