Build Question Answering System with Langchain and AnalyticDB
End-to-end question answering workflow using Langchain, AnalyticDB vector database, and OpenAI embeddings to build a knowledge base that retrieves context and
Why it matters
Implement an intelligent question-answering system by leveraging Langchain, OpenAI embeddings, and AnalyticDB as a knowledge base. This asset enables efficient retrieval and synthesis of information to provide accurate answers.
Outcomes
What it gets done
Calculate document embeddings using OpenAI API.
Store embeddings in AnalyticDB to create a searchable knowledge base.
Vectorize user queries and perform nearest neighbor searches in AnalyticDB.
Utilize retrieved context to generate answers with an LLM.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-qawithlangchainanalyticdbandopenai | bash Steps
Steps in the chain
Install the following Python packages: openai, tiktoken, langchain and psycopg2cffi. openai provides convenient access to the OpenAI API. tiktoken is a fast BPE tokeniser for use with OpenAI's models. langchain helps us to build applications with LLM more easily. psycopg2cffi library is used to interact with the vector database, but any other PostgreSQL client library is also acceptable.
The OpenAI API key is used for vectorization of the documents and queries. If you don't have an OpenAI API key, get one from https://platform.openai.com/account/api-keys. Once you get your key, add it to your environment variables as OPENAI_API_KEY.
To build the AnalyticDB connection string, you need to have the following parameters: PG_HOST, PG_PORT, PG_DATABASE, PG_USER, and PG_PASSWORD. Export them first to set correct connect string. Then build the connection string.
Load the data containing some natural questions and answers to them. All the data will be used to create a Langchain application with AnalyticDB being the knowledge base.
Langchain is already integrated with AnalyticDB and performs all the indexing for given list of documents. Store the set of answers you have. At this stage all the possible answers are already stored in AnalyticDB, so you can define the whole QA chain.
Once the data is put into AnalyticDB you can start asking some questions. A question will be automatically vectorized by OpenAI model, and the created vector will be used to find some possibly matching answers in AnalyticDB. Once retrieved, the most similar answers will be incorporated into the prompt sent to OpenAI Large Language Model.
Provide your own prompt template to change the behaviour of the OpenAI LLM while still using the stuff chain type. Keep {context} and {question} as placeholders in your custom template.
Try using a different prompt template so the model: 1. Responds with a single-sentence answer if it knows it. 2. Suggests a random song title if it doesn't know the answer to your question.
Overview
Question Answering with Langchain, AnalyticDB and OpenAI
What it does
This prompt chain implements a complete question answering system that combines Langchain, AnalyticDB (Alibaba Cloud's PostgreSQL-compatible vector database), and OpenAI's embedding and language models. It calculates embeddings for documents using the OpenAI API, stores them in AnalyticDB to create a searchable knowledge base, converts user queries into embeddings, performs nearest neighbor searches to find relevant context, and uses an LLM to generate answers grounded in that context. The entire workflow is simplified through Langchain methods that handle indexing, retrieval, and answer gener
How it connects
Use this workflow when you need to build a question answering system over a custom corpus of documents, especially when working within the Alibaba Cloud ecosystem or requiring AnalyticDB's PostgreSQL compatibility. The notebook demonstrates loading data containing natural questions and answers to create a Langchain application with AnalyticDB as the knowledge base. Do NOT use this if you need real-time streaming answers or if your use case requires databases other than AnalyticDB-the implementation is specifically tied to AnalyticDB's vector search capabilities. Avoid this approach if you don'
Source README
Question Answering with Langchain, AnalyticDB and OpenAI
This notebook presents how to implement a Question Answering system with Langchain, AnalyticDB as a knowledge based and OpenAI embeddings. If you are not familiar with AnalyticDB, it’s better to check out the Getting_started_with_AnalyticDB_and_OpenAI.ipynb notebook.
This notebook presents an end-to-end process of:
- Calculating the embeddings with OpenAI API.
- Storing the embeddings in an AnalyticDB instance to build a knowledge base.
- Converting raw text query to an embedding with OpenAI API.
- Using AnalyticDB to perform the nearest neighbour search in the created collection to find some context.
- Asking LLM to find the answer in a given context.
All the steps will be simplified to calling some corresponding Langchain methods.
Prerequisites
For the purposes of this exercise we need to prepare a couple of things:
AnalyticDB cloud instance.
Langchain as a framework.
An OpenAI API key.
Install requirements
This notebook requires the following Python packages: openai, tiktoken, langchain and psycopg2cffi.
openaiprovides convenient access to the OpenAI API.tiktokenis a fast BPE tokeniser for use with OpenAI's models.langchainhelps us to build applications with LLM more easily.psycopg2cffilibrary is used to interact with the vector database, but any other PostgreSQL client library is also acceptable.
Prepare your OpenAI API key
The OpenAI API key is used for vectorization of the documents and queries.
If you don't have an OpenAI API key, you can get one from [https://platform.openai.com/account/api-keys ).
Once you get your key, please add it to your environment variables as OPENAI_API_KEY by running following command:
Prepare your AnalyticDB connection string
To build the AnalyticDB connection string, you need to have the following parameters: PG_HOST, PG_PORT, PG_DATABASE, PG_USER, and PG_PASSWORD. You need to export them first to set correct connect string. Then build the connection string.
Load data
In this section we are going to load the data containing some natural questions and answers to them. All the data will be used to create a Langchain application with AnalyticDB being the knowledge base.
Chain definition
Langchain is already integrated with AnalyticDB and performs all the indexing for given list of documents. In our case we are going to store the set of answers we have.
At this stage all the possible answers are already stored in AnalyticDB, so we can define the whole QA chain.
Search data
Once the data is put into AnalyticDB we can start asking some questions. A question will be automatically vectorized by OpenAI model, and the created vector will be used to find some possibly matching answers in AnalyticDB. Once retrieved, the most similar answers will be incorporated into the prompt sent to OpenAI Large Language Model.
Custom prompt templates
The stuff chain type in Langchain uses a specific prompt with question and context documents incorporated. This is what the default prompt looks like:
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
We can, however, provide our prompt template and change the behaviour of the OpenAI LLM, while still using the stuff chain type. It is important to keep {context} and {question} as placeholders.
Experimenting with custom prompts
We can try using a different prompt template, so the model:
- Responds with a single-sentence answer if it knows it.
- Suggests a random song title if it doesn't know the answer to our question.
Step 1: Install requirements
Install the following Python packages: openai, tiktoken, langchain and psycopg2cffi. openai provides convenient access to the OpenAI API. tiktoken is a fast BPE tokeniser for use with OpenAI's models. langchain helps us to build applications with LLM more easily. psycopg2cffi library is used to interact with the vector database, but any other PostgreSQL client library is also acceptable.
Step 2: Prepare your OpenAI API key
The OpenAI API key is used for vectorization of the documents and queries. If you don't have an OpenAI API key, get one from https://platform.openai.com/account/api-keys. Once you get your key, add it to your environment variables as OPENAI_API_KEY.
Step 3: Prepare your AnalyticDB connection string
To build the AnalyticDB connection string, you need to have the following parameters: PG_HOST, PG_PORT, PG_DATABASE, PG_USER, and PG_PASSWORD. Export them first to set correct connect string. Then build the connection string.
Step 4: Load data
Load the data containing some natural questions and answers to them. All the data will be used to create a Langchain application with AnalyticDB being the knowledge base.
Step 5: Chain definition
Langchain is already integrated with AnalyticDB and performs all the indexing for given list of documents. Store the set of answers you have. At this stage all the possible answers are already stored in AnalyticDB, so you can define the whole QA chain.
Step 6: Search data
Once the data is put into AnalyticDB you can start asking some questions. A question will be automatically vectorized by OpenAI model, and the created vector will be used to find some possibly matching answers in AnalyticDB. Once retrieved, the most similar answers will be incorporated into the prompt sent to OpenAI Large Language Model.
Step 7: Custom prompt templates
Provide your own prompt template to change the behaviour of the OpenAI LLM while still using the stuff chain type. Keep {context} and {question} as placeholders in your custom template.Step 8: Experimenting with custom prompts
Try using a different prompt template so the model: 1. Responds with a single-sentence answer if it knows it. 2. Suggests a random song title if it doesn't know the answer to your question.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.