Query Deep Lake Datasets with LangChain and OpenAI
Build a question answering system using LangChain, Deep Lake, and OpenAI embeddings to query text datasets.
Why it matters
Build a question-answering system that leverages LangChain and OpenAI to query and retrieve information from a Deep Lake vector store.
Outcomes
What it gets done
Load and index text data into a Deep Lake vector store.
Initialize a LangChain vector store with Deep Lake.
Perform semantic searches on the indexed data.
Generate answers to user queries using LLMs.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-deeplakelangchainqa | bash Steps
Steps in the chain
Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.
Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.
Populate the vector store with samples, one batch at a time, using the add_texts method.
Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.
Question answering system with LangChain, Deep Lake, and OpenAI is complete.
Overview
Question Answering with LangChain, Deep Lake, & OpenAI
What it does
This prompt chain implements a question answering system using LangChain, Deep Lake as a vector store, and OpenAI embeddings. It loads a Deep Lake text dataset, initializes a Deep Lake vector store with LangChain, adds text to the vector store, and runs queries on the database.
How it connects
Use this prompt chain when you need to build a system that can answer questions based on a specific text dataset. Do not use this if you do not intend to use OpenAI for embeddings and LangChain for orchestration.
Source README
Question Answering with LangChain, Deep Lake, & OpenAI
This notebook shows how to implement a question answering system with LangChain, Deep Lake as a vector store and OpenAI embeddings. We will take the following steps to achieve this:
- Load a Deep Lake text dataset
- Initialize a Deep Lake vector store with LangChain
- Add text to the vector store
- Run queries on the database
- Done!
You can also follow other tutorials such as question answering over any type of data (PDFs, json, csv, text): chatting with any data stored in Deep Lake, code understanding, or question answering over PDFs, or recommending songs.
Install requirements
Let's install the following packages.
!pip install deeplake langchain openai tiktoken
Authentication
Provide your OpenAI API key here:
import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass.getpass()
Load a Deep Lake text dataset
We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.
import deeplake
ds = deeplake.load("hub://activeloop/cohere-wikipedia-22-sample")
ds.summary()
Let's take a look at a few samples:
ds[:3].text.data()["value"]
LangChain's Deep Lake vector store
Let's define a dataset_path, this is where your Deep Lake vector store will house the text embeddings.
dataset_path = 'wikipedia-embeddings-deeplake'
We will setup OpenAI's text-embedding-3-small as our embedding function and initialize a Deep Lake vector store at dataset_path...
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
embedding = OpenAIEmbeddings(model="text-embedding-3-small")
db = DeepLake(dataset_path, embedding=embedding, overwrite=True)
... and populate it with samples, one batch at a time, using the add_texts method.
from tqdm.auto import tqdm
batch_size = 100
nsamples = 10 # for testing. Replace with len(ds) to append everything
for i in tqdm(range(0, nsamples, batch_size)):
# find end of batch
i_end = min(nsamples, i + batch_size)
batch = ds[i:i_end]
id_batch = batch.ids.data()["value"]
text_batch = batch.text.data()["value"]
meta_batch = batch.metadata.data()["value"]
db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)
Run user queries on the database
The underlying Deep Lake dataset object is accessible through db.vectorstore.dataset, and the data structure can be summarized using db.vectorstore.summary(), which shows 4 tensors with 10 samples:
db.vectorstore.summary()
We will now setup QA on our vector store with GPT-3.5-Turbo as our LLM.
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# Re-load the vector store in case it's no longer initialized
# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())
Let's try running a prompt and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.
query = 'Why does the military not say 24:00?'
qa.run(query)
Et voila!
Step 1: Load a Deep Lake text dataset
Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.
Step 2: Initialize a Deep Lake vector store with LangChain
Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.
Step 3: Add text to the vector store
Populate the vector store with samples, one batch at a time, using the add_texts method.
Step 4: Run queries on the database
Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.
Step 5: Done
Question answering system with LangChain, Deep Lake, and OpenAI is complete.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.