Retrieve and Synthesize Multi-Document Insights
A LlamaPack that implements structured hierarchical retrieval over multiple documents using Weaviate collections.
Why it matters
Leverage structured hierarchical retrieval across multiple Weaviate collections to efficiently query and synthesize information from diverse documents.
Outcomes
What it gets done
Index and structure data across multiple Weaviate collections.
Perform hierarchical retrieval over document metadata and content.
Query and synthesize information from a collection of documents.
Utilize Weaviate for efficient vector storage and retrieval.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-pack-packs-multidoc-autoretrieval | bash Steps
Steps in the chain
Use llamaindex-cli to download the pack: llamaindex-cli download-llamapack MultiDocAutoRetrieverPack --download-dir ./multidoc_autoretrieval_pack. Inspect the files at ./multidoc_autoretrieval_pack and use them as a template for your own project.
Import download_llama_pack from llama_index.core.llama_pack and download the pack to ./multidoc_autoretrieval_pack directory: MultiDocAutoRetrieverPack = download_llama_pack('MultiDocAutoRetrieverPack', './multidoc_autoretrieval_pack')
Set up Weaviate cloud authentication using AuthApiKey with your API key and create a Weaviate Client instance pointing to your cluster: auth_config = weaviate.AuthApiKey(api_key='<api_key>'); client = weaviate.Client('https://<cluster>.weaviate.network', auth_client_secret=auth_config)
Create a VectorStoreInfo object with content_info describing your data (e.g., 'Github Issues') and metadata_info list containing MetadataInfo objects that describe each metadata field, including name, description, and type.
Create metadata_nodes as a set of TextNode objects with metadata representing each document, and docs as the source Document objects. Both lists must be the same length.
Instantiate the pack with: pack = MultiDocAutoRetrieverPack(client, '<metadata_index_name>', '<doc_chunks_index_name>', metadata_nodes, docs, vector_store_info, auto_retriever_kwargs={...})
Use the pack's run() function as a wrapper around query_engine.query() to execute queries: response = pack.run('Tell me about a Music celebrity.')
Access the retriever directly via pack.retriever and call retrieve() with a query string to get nodes: nodes = retriever.retrieve('query_str')
Access the query engine directly via pack.query_engine and call query() with a query string: response = query_engine.query('query_str')
Overview
Multi-Document AutoRetrieval (with Weaviate) Pack
What it does
This pack implements structured hierarchical retrieval over multiple documents using Weaviate collections, providing both a retriever and query engine interface.
How it connects
Use this pack when you need structured hierarchical retrieval over multiple documents with metadata, stored in Weaviate collections.
Source README
Multi-Document AutoRetrieval (with Weaviate) Pack
This LlamaPack implements structured hierarchical retrieval over multiple documents, using multiple @weaviate_io collections.
CLI Usage
You can download llamapacks directly using llamaindex-cli, which comes installed with the llama-index python package:
llamaindex-cli download-llamapack MultiDocAutoRetrieverPack --download-dir ./multidoc_autoretrieval_pack
You can then inspect the files at ./multidoc_autoretrieval_pack and use them as a template for your own project!
Code Usage
You can download the pack to a the ./multidoc_autoretrieval_pack directory:
from llama_index.core.llama_pack import download_llama_pack
### download and install dependencies
MultiDocAutoRetrieverPack = download_llama_pack(
"MultiDocAutoRetrieverPack", "./multidoc_autoretrieval_pack"
)
From here, you can use the pack. To initialize it, you need to define a few arguments, see below.
Then, you can set up the pack like so:
### setup pack arguments
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
import weaviate
### cloud
auth_config = weaviate.AuthApiKey(api_key="<api_key>")
client = weaviate.Client(
"https://<cluster>.weaviate.network",
auth_client_secret=auth_config,
)
vector_store_info = VectorStoreInfo(
content_info="Github Issues",
metadata_info=[
MetadataInfo(
name="state",
description="Whether the issue is `open` or `closed`",
type="string",
),
...,
],
)
### metadata_nodes is set of nodes with metadata representing each document
### docs is the source docs
### metadata_nodes and docs must be the same length
metadata_nodes = [TextNode(..., metadata={...}), ...]
docs = [Document(...), ...]
pack = MultiDocAutoRetrieverPack(
client,
"<metadata_index_name>",
"<doc_chunks_index_name>",
metadata_nodes,
docs,
vector_store_info,
auto_retriever_kwargs={
# any kwargs for the auto-retriever
...
},
)
The run() function is a light wrapper around query_engine.query().
response = pack.run("Tell me a bout a Music celebritiy.")
You can also use modules individually.
### use the retriever
retriever = pack.retriever
nodes = retriever.retrieve("query_str")
### use the query engine
query_engine = pack.query_engine
response = query_engine.query("query_str")
Step 1: Download the MultiDocAutoRetrieverPack
Use llamaindex-cli to download the pack: llamaindex-cli download-llamapack MultiDocAutoRetrieverPack --download-dir ./multidoc_autoretrieval_pack. Inspect the files at ./multidoc_autoretrieval_pack and use them as a template for your own project.
Step 2: Import and download pack via Python
Import download_llama_pack from llama_index.core.llama_pack and download the pack to ./multidoc_autoretrieval_pack directory: MultiDocAutoRetrieverPack = download_llama_pack('MultiDocAutoRetrieverPack', './multidoc_autoretrieval_pack')Step 3: Configure Weaviate client and authentication
Set up Weaviate cloud authentication using AuthApiKey with your API key and create a Weaviate Client instance pointing to your cluster: auth_config = weaviate.AuthApiKey(api_key='<api_key>'); client = weaviate.Client('https://<cluster>.weaviate.network', auth_client_secret=auth_config)Step 4: Define VectorStoreInfo with metadata
Create a VectorStoreInfo object with content_info describing your data (e.g., 'Github Issues') and metadata_info list containing MetadataInfo objects that describe each metadata field, including name, description, and type.
Step 5: Prepare metadata nodes and documents
Create metadata_nodes as a set of TextNode objects with metadata representing each document, and docs as the source Document objects. Both lists must be the same length.
Step 6: Initialize the MultiDocAutoRetrieverPack
Instantiate the pack with: pack = MultiDocAutoRetrieverPack(client, '<metadata_index_name>', '<doc_chunks_index_name>', metadata_nodes, docs, vector_store_info, auto_retriever_kwargs={...})Step 7: Execute queries using pack.run()
Use the pack's run() function as a wrapper around query_engine.query() to execute queries: response = pack.run('Tell me about a Music celebrity.')Step 8: Use retriever module individually
Access the retriever directly via pack.retriever and call retrieve() with a query string to get nodes: nodes = retriever.retrieve('query_str')Step 9: Use query engine module individually
Access the query engine directly via pack.query_engine and call query() with a query string: response = query_engine.query('query_str')Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.