Augment LLM Answers with External Knowledge
Query relevant contexts from Pinecone and pass them to a generative OpenAI model to generate an answer backed by real data sources.
Why it matters
Enhance Large Language Model (LLM) responses by retrieving relevant information from an external knowledge base, reducing hallucinations and improving factual accuracy.
Outcomes
What it gets done
Index external data into a vector database (Pinecone) using embeddings.
Query the vector database for semantically relevant information.
Pass retrieved context to an LLM for augmented question answering.
Overcome LLM limitations in answering specific or niche domain questions.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-genqa | bash Steps
Steps in the chain
Learn how GPT models can hallucinate or make up information, especially for specific topics outside their general knowledge. Understand that state-of-the-art LLMs may not always answer factually correct questions, particularly about specialized domains.
Understand RAG as a technique that implements an information retrieval component to the generation process. This allows retrieving relevant information and feeding it into the generation model as a secondary source of information to fix hallucination issues.
Use the text-embedding-3-small model to create dense vector embeddings, which are numerical representations of the meaning behind sentences. These embeddings will be used for semantic search and retrieval.
Download the jamescalam/youtube-transcriptions dataset from Hugging Face Datasets containing transcribed audio from ML and tech YouTube channels. Merge many small snippets from each video to create substantial chunks of text with more information.
Set up a Pinecone vector database connection using a free API key. Create a new index to store embeddings and enable efficient vector search through all embedded text chunks.
Embed the prepared dataset chunks using OpenAI's text-embedding-3-small model and populate the Pinecone index with these embeddings to build the knowledge base.
Create a query vector using the text-embedding-3-small model for your question. Search the Pinecone index to retrieve semantically relevant text chunks that will serve as context for the LLM.
Overview
Retrieval Augmented Generative Question Answering with Pinecone
What it does
This notebook demonstrates how to query relevant contexts from Pinecone and pass them to a generative OpenAI model to generate an answer backed by real data sources. This approach uses Pinecone as an external knowledge base, akin to long-term memory for the generative model, to address limitations of LLMs that may not have specific information or can sometimes generate incorrect responses.
How it connects
Use this workflow when you need to provide specific information to a generative model to help it answer questions. For example, when asking a specific question about training a sentence transformer model, a standard LLM might provide an incorrect answer. This workflow can be used to retrieve the correct answer from a knowledge base.
Source README
Retrieval Augmented Generative Question Answering with Pinecone
Fixing LLMs that Hallucinate
In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources.
A common problem with using GPT-3 to factually answer questions is that GPT-3 can sometimes make things up. The GPT models have a broad range of general knowledge, but this does not necessarily apply to more specific information. For that we use the Pinecone vector database as our "external knowledge base" - like long-term memory for GPT-3.
Required installs for this notebook are:
!pip install -qU openai pinecone-client datasets
import openai
# get API key from top-right dropdown on OpenAI website
openai.api_key = "OPENAI_API_KEY"
For many questions state-of-the-art (SOTA) LLMs are more than capable of answering correctly.
query = "who was the 12th person on the moon and when did they land?"
# now query `gpt-3.5-turbo-instruct` WITHOUT context
res = openai.Completion.create(
engine='gpt-3.5-turbo-instruct',
prompt=query,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
res['choices'][0]['text'].strip()
However, that isn't always the case. First let's first rewrite the above into a simple function so we're not rewriting this every time.
def complete(prompt):
res = openai.Completion.create(
engine='gpt-3.5-turbo-instruct',
prompt=prompt,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return res['choices'][0]['text'].strip()
Now let's ask a more specific question about training a type of transformer model called a sentence transformer. The ideal answer we'd be looking for is "Multiple Negatives Ranking (MNR) loss".
Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here.
query = (
"Which training method should I use for sentence transformers when " +
"I only have pairs of related sentences?"
)
complete(query)
One of the common answers we get to this is:
The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.
This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but "cannot" be used to fine-tune a sentence-transformer, and has nothing to do with having "pairs of related sentences".
An alternative answer we receive (and the one we returned above) is about supervised learning approach being the most suitable. This is completely true, but it's not specific and doesn't answer the question.
We have two options for enabling our LLM in understanding and correctly answering this question:
We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.
We use Retrieval Augmented Generation (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a secondary source of information.
We will demonstrate option 2.
Building a Knowledge Base
With option 2 the retrieval of relevant information requires an external "Knowledge Base", a place where we can store and use to efficiently retrieve information. We can think of this as the external long-term memory of our LLM.
We will need to retrieve information that is semantically related to our queries, to do this we need to use "dense vector embeddings". These can be thought of as numerical representations of the meaning behind our sentences.
To create these dense vectors we use the text-embedding-3-small model.
We have already authenticated our OpenAI connection, to create an embedding we just do:
embed_model = "text-embedding-ada-002"
res = openai.Embedding.create(
input=[
"Sample document text goes here",
"there will be several phrases in each batch"
], engine=embed_model
)
In the response res we will find a JSON-like object containing our new embeddings within the 'data' field.
res.keys()
Inside 'data' we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains 1536 dimensions (the output dimensionality of the text-embedding-3-small model.
len(res['data'])
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])
We will apply this same embedding logic to a dataset containing information relevant to our query (and many other queries on the topics of ML and AI).
Data Preparation
The dataset we will be using is the jamescalam/youtube-transcriptions from Hugging Face Datasets. It contains transcribed audio from several ML and tech YouTube channels. We download it with:
from datasets import load_dataset
data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data
data[0]
The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information.
from tqdm.auto import tqdm
new_data = []
window = 20 # number of sentences to combine
stride = 4 # number of sentences to 'stride' over, used to create overlap
for i in tqdm(range(0, len(data), stride)):
i_end = min(len(data)-1, i+window)
if data[i]['title'] != data[i_end]['title']:
# in this case we skip this entry as we have start/end of two videos
continue
text = ' '.join(data[i:i_end]['text'])
# create the new merged dataset
new_data.append({
'start': data[i]['start'],
'end': data[i_end]['end'],
'title': data[i]['title'],
'text': text,
'id': data[i]['id'],
'url': data[i]['url'],
'published': data[i]['published'],
'channel_id': data[i]['channel_id']
})
new_data[0]
Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a free API key and enter it below where we will initialize our connection to Pinecone and create a new index.
import pinecone
index_name = 'openai-youtube-transcriptions'
# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
api_key="PINECONE_API_KEY",
environment="us-east1-gcp" # may be different, check at app.pinecone.io
)
# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
# if does not exist, create index
pinecone.create_index(
index_name,
dimension=len(res['data'][0]['embedding']),
metric='cosine',
metadata_config={'indexed': ['channel_id', 'published']}
)
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()
We can see the index is currently empty with a total_vector_count of 0. We can begin populating it with OpenAI text-embedding-3-small built embeddings like so:
from tqdm.auto import tqdm
from time import sleep
batch_size = 100 # how many embeddings we create and insert at once
for i in tqdm(range(0, len(new_data), batch_size)):
# find end of batch
i_end = min(len(new_data), i+batch_size)
meta_batch = new_data[i:i_end]
# get ids
ids_batch = [x['id'] for x in meta_batch]
# get texts to encode
texts = [x['text'] for x in meta_batch]
# create embeddings (try-except added to avoid RateLimitError)
done = False
while not done:
try:
res = openai.Embedding.create(input=texts, engine=embed_model)
done = True
except:
sleep(5)
embeds = [record['embedding'] for record in res['data']]
# cleanup metadata
meta_batch = [{
'start': x['start'],
'end': x['end'],
'title': x['title'],
'text': x['text'],
'url': x['url'],
'published': x['published'],
'channel_id': x['channel_id']
} for x in meta_batch]
to_upsert = list(zip(ids_batch, embeds, meta_batch))
# upsert to Pinecone
index.upsert(vectors=to_upsert)
Now we search, for this we need to create a query vector xq:
res = openai.Embedding.create(
input=[query],
engine=embed_model
)
# retrieve from Pinecone
xq = res['data'][0]['embedding']
# get relevant contexts (including the questions)
res = index.query(xq, top_k=2, include_metadata=True)
res
limit = 3750
def retrieve(query):
res = openai.Embedding.create(
input=[query],
engine=embed_model
)
# retrieve from Pinecone
xq = res['data'][0]['embedding']
# get relevant contexts
res = index.query(xq, top_k=3, include_metadata=True)
contexts = [
x['metadata']['text'] for x in res['matches']
]
# build our prompt with the retrieved contexts included
prompt_start = (
"Answer the question based on the context below.\n\n"+
"Context:\n"
)
prompt_end = (
f"\n\nQuestion: {query}\nAnswer:"
)
# append contexts until hitting limit
for i in range(1, len(contexts)):
if len("\n\n---\n\n".join(contexts[:i])) >= limit:
prompt = (
prompt_start +
"\n\n---\n\n".join(contexts[:i-1]) +
prompt_end
)
break
elif i == len(contexts)-1:
prompt = (
prompt_start +
"\n\n---\n\n".join(contexts) +
prompt_end
)
return prompt
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts
# then we complete the context-infused query
complete(query_with_contexts)
And we get a pretty great answer straight away, specifying to use multiple-rankings loss (also called multiple negatives ranking loss).
Step 1: Understand the Problem: LLM Hallucination
Learn how GPT models can hallucinate or make up information, especially for specific topics outside their general knowledge. Understand that state-of-the-art LLMs may not always answer factually correct questions, particularly about specialized domains.
Step 2: Introduce Retrieval Augmented Generation (RAG)
Understand RAG as a technique that implements an information retrieval component to the generation process. This allows retrieving relevant information and feeding it into the generation model as a secondary source of information to fix hallucination issues.
Step 3: Create Dense Vector Embeddings
Use the text-embedding-3-small model to create dense vector embeddings, which are numerical representations of the meaning behind sentences. These embeddings will be used for semantic search and retrieval.
Step 4: Prepare and Chunk Dataset
Download the jamescalam/youtube-transcriptions dataset from Hugging Face Datasets containing transcribed audio from ML and tech YouTube channels. Merge many small snippets from each video to create substantial chunks of text with more information.
Step 5: Initialize Pinecone and Create Index
Set up a Pinecone vector database connection using a free API key. Create a new index to store embeddings and enable efficient vector search through all embedded text chunks.
Step 6: Populate Pinecone Index with Embeddings
Embed the prepared dataset chunks using OpenAI's text-embedding-3-small model and populate the Pinecone index with these embeddings to build the knowledge base.
Step 7: Query and Retrieve Relevant Context
Create a query vector using the text-embedding-3-small model for your question. Search the Pinecone index to retrieve semantically relevant text chunks that will serve as context for the LLM.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.