Prompt Chain

Augment LLM Answers with External Knowledge

Name: Augment LLM Answers with External Knowledge
Availability: OnlineOnly
Author: OpenAI Cookbook

Query relevant contexts from Pinecone and pass them to a generative OpenAI model to generate an answer backed by real data sources.

Copy chain

Works with openaipineconehuggingface

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated yesterday

Version 1.0.0

Models

gpt 4o

Add to Favorites

Why it matters

Enhance Large Language Model (LLM) responses by retrieving relevant information from an external knowledge base, reducing hallucinations and improving factual accuracy.

Outcomes

What it gets done

Index external data into a vector database (Pinecone) using embeddings.

Query the vector database for semantically relevant information.

Pass retrieved context to an LLM for augmented question answering.

Overcome LLM limitations in answering specific or niche domain questions.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-genqa | bash

Steps

Steps in the chain

Understand the Problem: LLM Hallucination

Learn how GPT models can hallucinate or make up information, especially for specific topics outside their general knowledge. Understand that state-of-the-art LLMs may not always answer factually correct questions, particularly about specialized domains.

Introduce Retrieval Augmented Generation (RAG)

Understand RAG as a technique that implements an information retrieval component to the generation process. This allows retrieving relevant information and feeding it into the generation model as a secondary source of information to fix hallucination issues.

Create Dense Vector Embeddings

Use the text-embedding-3-small model to create dense vector embeddings, which are numerical representations of the meaning behind sentences. These embeddings will be used for semantic search and retrieval.

Prepare and Chunk Dataset

Download the jamescalam/youtube-transcriptions dataset from Hugging Face Datasets containing transcribed audio from ML and tech YouTube channels. Merge many small snippets from each video to create substantial chunks of text with more information.

Initialize Pinecone and Create Index

Set up a Pinecone vector database connection using a free API key. Create a new index to store embeddings and enable efficient vector search through all embedded text chunks.

Populate Pinecone Index with Embeddings

Embed the prepared dataset chunks using OpenAI's text-embedding-3-small model and populate the Pinecone index with these embeddings to build the knowledge base.

Query and Retrieve Relevant Context

Create a query vector using the text-embedding-3-small model for your question. Search the Pinecone index to retrieve semantically relevant text chunks that will serve as context for the LLM.

Overview

Retrieval Augmented Generative Question Answering with Pinecone

What it does

This notebook demonstrates how to query relevant contexts from Pinecone and pass them to a generative OpenAI model to generate an answer backed by real data sources. This approach uses Pinecone as an external knowledge base, akin to long-term memory for the generative model, to address limitations of LLMs that may not have specific information or can sometimes generate incorrect responses.

How it connects

Use this workflow when you need to provide specific information to a generative model to help it answer questions. For example, when asking a specific question about training a sentence transformer model, a standard LLM might provide an incorrect answer. This workflow can be used to retrieve the correct answer from a knowledge base.

Source README

Retrieval Augmented Generative Question Answering with Pinecone

Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources.

A common problem with using GPT-3 to factually answer questions is that GPT-3 can sometimes make things up. The GPT models have a broad range of general knowledge, but this does not necessarily apply to more specific information. For that we use the Pinecone vector database as our "external knowledge base" - like long-term memory for GPT-3.

Required installs for this notebook are:

!pip install -qU openai pinecone-client datasets

import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "OPENAI_API_KEY"

For many questions state-of-the-art (SOTA) LLMs are more than capable of answering correctly.

query = "who was the 12th person on the moon and when did they land?"

# now query `gpt-3.5-turbo-instruct` WITHOUT context
res = openai.Completion.create(
    engine='gpt-3.5-turbo-instruct',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

However, that isn't always the case. First let's first rewrite the above into a simple function so we're not rewriting this every time.

def complete(prompt):
    res = openai.Completion.create(
        engine='gpt-3.5-turbo-instruct',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

Now let's ask a more specific question about training a type of transformer model called a sentence transformer. The ideal answer we'd be looking for is "Multiple Negatives Ranking (MNR) loss".

Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here.

query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)

One of the common answers we get to this is:

The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.

This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but "cannot" be used to fine-tune a sentence-transformer, and has nothing to do with having "pairs of related sentences".

An alternative answer we receive (and the one we returned above) is about supervised learning approach being the most suitable. This is completely true, but it's not specific and doesn't answer the question.

We have two options for enabling our LLM in understanding and correctly answering this question:

We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.
We use Retrieval Augmented Generation (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a secondary source of information.

We will demonstrate option 2.

Building a Knowledge Base

With option 2 the retrieval of relevant information requires an external "Knowledge Base", a place where we can store and use to efficiently retrieve information. We can think of this as the external long-term memory of our LLM.

We will need to retrieve information that is semantically related to our queries, to do this we need to use "dense vector embeddings". These can be thought of as numerical representations of the meaning behind our sentences.

To create these dense vectors we use the text-embedding-3-small model.

We have already authenticated our OpenAI connection, to create an embedding we just do:

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response res we will find a JSON-like object containing our new embeddings within the 'data' field.

res.keys()

Inside 'data' we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains 1536 dimensions (the output dimensionality of the text-embedding-3-small model.

len(res['data'])

len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

We will apply this same embedding logic to a dataset containing information relevant to our query (and many other queries on the topics of ML and AI).

Data Preparation

The dataset we will be using is the jamescalam/youtube-transcriptions from Hugging Face Datasets. It contains transcribed audio from several ML and tech YouTube channels. We download it with:

from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

data[0]

The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information.

from tqdm.auto import tqdm

new_data = []

window = 20  # number of sentences to combine
stride = 4  # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

new_data[0]

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a free API key and enter it below where we will initialize our connection to Pinecone and create a new index.

import pinecone

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="PINECONE_API_KEY",
    environment="us-east1-gcp"  # may be different, check at app.pinecone.io
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine',
        metadata_config={'indexed': ['channel_id', 'published']}
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

We can see the index is currently empty with a total_vector_count of 0. We can begin populating it with OpenAI text-embedding-3-small built embeddings like so:

from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(new_data), batch_size)):
    # find end of batch
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    done = False
    while not done:
        try:
            res = openai.Embedding.create(input=texts, engine=embed_model)
            done = True
        except:
            sleep(5)
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'start': x['start'],
        'end': x['end'],
        'title': x['title'],
        'text': x['text'],
        'url': x['url'],
        'published': x['published'],
        'channel_id': x['channel_id']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

Now we search, for this we need to create a query vector xq:

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=2, include_metadata=True)

res

limit = 3750

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    res = index.query(xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

# then we complete the context-infused query
complete(query_with_contexts)

And we get a pretty great answer straight away, specifying to use multiple-rankings loss (also called multiple negatives ranking loss).

Step 1: Understand the Problem: LLM Hallucination

Learn how GPT models can hallucinate or make up information, especially for specific topics outside their general knowledge. Understand that state-of-the-art LLMs may not always answer factually correct questions, particularly about specialized domains.

Step 2: Introduce Retrieval Augmented Generation (RAG)

Understand RAG as a technique that implements an information retrieval component to the generation process. This allows retrieving relevant information and feeding it into the generation model as a secondary source of information to fix hallucination issues.

Step 3: Create Dense Vector Embeddings

Use the text-embedding-3-small model to create dense vector embeddings, which are numerical representations of the meaning behind sentences. These embeddings will be used for semantic search and retrieval.

Step 4: Prepare and Chunk Dataset

Download the jamescalam/youtube-transcriptions dataset from Hugging Face Datasets containing transcribed audio from ML and tech YouTube channels. Merge many small snippets from each video to create substantial chunks of text with more information.

Step 5: Initialize Pinecone and Create Index

Set up a Pinecone vector database connection using a free API key. Create a new index to store embeddings and enable efficient vector search through all embedded text chunks.

Step 6: Populate Pinecone Index with Embeddings

Embed the prepared dataset chunks using OpenAI's text-embedding-3-small model and populate the Pinecone index with these embeddings to build the knowledge base.

Step 7: Query and Retrieve Relevant Context

Create a query vector using the text-embedding-3-small model for your question. Search the Pinecone index to retrieve semantically relevant text chunks that will serve as context for the LLM.

Discussion

Augment LLM Answers with External Knowledge

What it gets done

Add it to your toolbox

Steps in the chain

Retrieval Augmented Generative Question Answering with Pinecone

What it does

How it connects

Retrieval Augmented Generative Question Answering with Pinecone

Fixing LLMs that Hallucinate

Building a Knowledge Base

Data Preparation

Step 1: Understand the Problem: LLM Hallucination

Step 2: Introduce Retrieval Augmented Generation (RAG)

Step 3: Create Dense Vector Embeddings

Step 4: Prepare and Chunk Dataset

Step 5: Initialize Pinecone and Create Index

Step 6: Populate Pinecone Index with Embeddings

Step 7: Query and Retrieve Relevant Context

Questions & comments · 0