Prompt Chain

Query Deep Lake Datasets with LangChain and OpenAI

Name: Query Deep Lake Datasets with LangChain and OpenAI
Availability: OnlineOnly
Author: OpenAI Cookbook

Build a question answering system using LangChain, Deep Lake, and OpenAI embeddings to query text datasets.

Copy chain

Works with openai langchaindeeplake

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated today

Version 1.0.0

Models

gpt 4ogpt 4

Add to Favorites

Why it matters

Build a question-answering system that leverages LangChain and OpenAI to query and retrieve information from a Deep Lake vector store.

Outcomes

What it gets done

Load and index text data into a Deep Lake vector store.

Initialize a LangChain vector store with Deep Lake.

Perform semantic searches on the indexed data.

Generate answers to user queries using LLMs.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-deeplakelangchainqa | bash

Steps

Steps in the chain

Load a Deep Lake text dataset

Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

Initialize a Deep Lake vector store with LangChain

Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.

Add text to the vector store

Populate the vector store with samples, one batch at a time, using the add_texts method.

Run queries on the database

Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

Done

Question answering system with LangChain, Deep Lake, and OpenAI is complete.

Overview

Question Answering with LangChain, Deep Lake, & OpenAI

What it does

This prompt chain implements a question answering system using LangChain, Deep Lake as a vector store, and OpenAI embeddings. It loads a Deep Lake text dataset, initializes a Deep Lake vector store with LangChain, adds text to the vector store, and runs queries on the database.

How it connects

Use this prompt chain when you need to build a system that can answer questions based on a specific text dataset. Do not use this if you do not intend to use OpenAI for embeddings and LangChain for orchestration.

Source README

Question Answering with LangChain, Deep Lake, & OpenAI

This notebook shows how to implement a question answering system with LangChain, Deep Lake as a vector store and OpenAI embeddings. We will take the following steps to achieve this:

Load a Deep Lake text dataset
Initialize a Deep Lake vector store with LangChain
Add text to the vector store
Run queries on the database
Done!

You can also follow other tutorials such as question answering over any type of data (PDFs, json, csv, text): chatting with any data stored in Deep Lake, code understanding, or question answering over PDFs, or recommending songs.

Install requirements

Let's install the following packages.

!pip install deeplake langchain openai tiktoken

Authentication

Provide your OpenAI API key here:

import getpass
import os

os.environ['OPENAI_API_KEY'] = getpass.getpass()

Load a Deep Lake text dataset

We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

import deeplake

ds = deeplake.load("hub://activeloop/cohere-wikipedia-22-sample")
ds.summary()

Let's take a look at a few samples:

ds[:3].text.data()["value"]

LangChain's Deep Lake vector store

Let's define a dataset_path, this is where your Deep Lake vector store will house the text embeddings.

dataset_path = 'wikipedia-embeddings-deeplake'

We will setup OpenAI's text-embedding-3-small as our embedding function and initialize a Deep Lake vector store at dataset_path...

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embedding = OpenAIEmbeddings(model="text-embedding-3-small")
db = DeepLake(dataset_path, embedding=embedding, overwrite=True)

... and populate it with samples, one batch at a time, using the add_texts method.

from tqdm.auto import tqdm

batch_size = 100

nsamples = 10  # for testing. Replace with len(ds) to append everything
for i in tqdm(range(0, nsamples, batch_size)):
    # find end of batch
    i_end = min(nsamples, i + batch_size)

    batch = ds[i:i_end]
    id_batch = batch.ids.data()["value"]
    text_batch = batch.text.data()["value"]
    meta_batch = batch.metadata.data()["value"]

    db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)

Run user queries on the database

The underlying Deep Lake dataset object is accessible through db.vectorstore.dataset, and the data structure can be summarized using db.vectorstore.summary(), which shows 4 tensors with 10 samples:

db.vectorstore.summary()

We will now setup QA on our vector store with GPT-3.5-Turbo as our LLM.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Re-load the vector store in case it's no longer initialized
# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())

Let's try running a prompt and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

query = 'Why does the military not say 24:00?'
qa.run(query)

Et voila!

Step 1: Load a Deep Lake text dataset

Load a Deep Lake text dataset. We will use a 20000 sample subset of the cohere-wikipedia-22 dataset for this example.

Step 2: Initialize a Deep Lake vector store with LangChain

Initialize a Deep Lake vector store with LangChain. Define a dataset_path where your Deep Lake vector store will house the text embeddings. Setup OpenAI's text-embedding-3-small as the embedding function and initialize a Deep Lake vector store at dataset_path.

Step 3: Add text to the vector store

Populate the vector store with samples, one batch at a time, using the add_texts method.

Step 4: Run queries on the database

Setup QA on the vector store with GPT-3.5-Turbo as the LLM. Run prompts and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

Step 5: Done

Question answering system with LangChain, Deep Lake, and OpenAI is complete.

Discussion

Query Deep Lake Datasets with LangChain and OpenAI

What it gets done

Add it to your toolbox

Steps in the chain

Question Answering with LangChain, Deep Lake, & OpenAI

What it does

How it connects

Question Answering with LangChain, Deep Lake, & OpenAI

Install requirements

Authentication

Load a Deep Lake text dataset

LangChain's Deep Lake vector store

Run user queries on the database

Step 1: Load a Deep Lake text dataset

Step 2: Initialize a Deep Lake vector store with LangChain

Step 3: Add text to the vector store

Step 4: Run queries on the database

Step 5: Done

Questions & comments · 0