Prompt Chain

Query Deep Lake Datasets with LangChain and OpenAI

Build a question answering system using LangChain, Deep Lake, and OpenAI embeddings to query text datasets.

Works with openailangchaindeeplake

59
Spark score
out of 100
Updated today
Version 1.0.0
Models
gpt 4ogpt 4

Add to Favorites

Why it matters

Build a question-answering system that leverages LangChain and OpenAI to query and retrieve information from a Deep Lake vector store.

Outcomes

What it gets done

01

Load and index text data into a Deep Lake vector store.

02

Initialize a LangChain vector store with Deep Lake.

03

Perform semantic searches on the indexed data.

04

Generate answers to user queries using LLMs.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-deeplakelangchainqa | bash

Steps

Steps in the chain

01
Load a Deep Lake text dataset

Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

02
Initialize a Deep Lake vector store with LangChain

Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.

03
Add text to the vector store

Populate the vector store with samples, one batch at a time, using the add_texts method.

04
Run queries on the database

Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

05
Done

Question answering system with LangChain, Deep Lake, and OpenAI is complete.

Overview

Question Answering with LangChain, Deep Lake, & OpenAI

What it does

This prompt chain implements a question answering system using LangChain, Deep Lake as a vector store, and OpenAI embeddings. It loads a Deep Lake text dataset, initializes a Deep Lake vector store with LangChain, adds text to the vector store, and runs queries on the database.

How it connects

Use this prompt chain when you need to build a system that can answer questions based on a specific text dataset. Do not use this if you do not intend to use OpenAI for embeddings and LangChain for orchestration.

Source README

Question Answering with LangChain, Deep Lake, & OpenAI

This notebook shows how to implement a question answering system with LangChain, Deep Lake as a vector store and OpenAI embeddings. We will take the following steps to achieve this:

  1. Load a Deep Lake text dataset
  2. Initialize a Deep Lake vector store with LangChain
  3. Add text to the vector store
  4. Run queries on the database
  5. Done!

You can also follow other tutorials such as question answering over any type of data (PDFs, json, csv, text): chatting with any data stored in Deep Lake, code understanding, or question answering over PDFs, or recommending songs.

Install requirements

Let's install the following packages.

!pip install deeplake langchain openai tiktoken

Authentication

Provide your OpenAI API key here:

import getpass
import os

os.environ['OPENAI_API_KEY'] = getpass.getpass()

Load a Deep Lake text dataset

We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

import deeplake

ds = deeplake.load("hub://activeloop/cohere-wikipedia-22-sample")
ds.summary()

Let's take a look at a few samples:

ds[:3].text.data()["value"]

LangChain's Deep Lake vector store

Let's define a dataset_path, this is where your Deep Lake vector store will house the text embeddings.

dataset_path = 'wikipedia-embeddings-deeplake'

We will setup OpenAI's text-embedding-3-small as our embedding function and initialize a Deep Lake vector store at dataset_path...

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embedding = OpenAIEmbeddings(model="text-embedding-3-small")
db = DeepLake(dataset_path, embedding=embedding, overwrite=True)

... and populate it with samples, one batch at a time, using the add_texts method.

from tqdm.auto import tqdm

batch_size = 100

nsamples = 10  # for testing. Replace with len(ds) to append everything
for i in tqdm(range(0, nsamples, batch_size)):
    # find end of batch
    i_end = min(nsamples, i + batch_size)

    batch = ds[i:i_end]
    id_batch = batch.ids.data()["value"]
    text_batch = batch.text.data()["value"]
    meta_batch = batch.metadata.data()["value"]

    db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)

Run user queries on the database

The underlying Deep Lake dataset object is accessible through db.vectorstore.dataset, and the data structure can be summarized using db.vectorstore.summary(), which shows 4 tensors with 10 samples:

db.vectorstore.summary()

We will now setup QA on our vector store with GPT-3.5-Turbo as our LLM.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Re-load the vector store in case it's no longer initialized
# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())

Let's try running a prompt and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

query = 'Why does the military not say 24:00?'
qa.run(query)

Et voila!

Step 1: Load a Deep Lake text dataset

Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

Step 2: Initialize a Deep Lake vector store with LangChain

Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.

Step 3: Add text to the vector store

Populate the vector store with samples, one batch at a time, using the add_texts method.

Step 4: Run queries on the database

Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

Step 5: Done

Question answering system with LangChain, Deep Lake, and OpenAI is complete.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.