Prompt Chain

Evaluate RAG Pipelines with LlamaIndex

Name: Evaluate RAG Pipelines with LlamaIndex
Availability: OnlineOnly
Author: OpenAI Cookbook

A LlamaIndex notebook workflow that builds and evaluates Retrieval Augmented Generation (RAG) pipelines using hit rate, MRR, faithfulness, and relevancy

Copy chain

Works with openaillamaindex

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 3 months ago

Version 1.0.0

Add to Favorites

Why it matters

Assess the performance and accuracy of your Retrieval Augmented Generation (RAG) systems. This asset helps you quantify the effectiveness of your RAG pipeline's retrieval and response generation stages.

Outcomes

What it gets done

Build and index data for RAG pipelines using LlamaIndex.

Evaluate retrieval accuracy using metrics like Hit Rate and MRR.

Assess response quality for faithfulness and relevancy.

Generate question-context pairs for comprehensive evaluation.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-evaluateragwithllamaindex | bash

Steps

Steps in the chain

Understanding Retrieval Augmented Generation (RAG)

Learn about RAG and how it addresses the limitation that LLMs are trained on vast datasets but don't include your specific data. Understand how RAG dynamically incorporates your data during generation by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

Building RAG with LlamaIndex

Build a simple RAG pipeline using LlamaIndex. Set your OpenAI API Key, download data (using Paul Graham Essay text), load data and build an index, build a QueryEngine and start querying, and check the responses retrieved by the system.

Evaluating RAG with LlamaIndex

Evaluate the performance of your RAG pipeline using LlamaIndex's core evaluation modules. Assess both retrieval accuracy and response quality to determine whether the pipeline produces accurate responses based on data sources and queries.

Question-Context Pair Generation

Generate question-context pairs using LlamaIndex's `generate_question_context_pairs` module. These pairs are essential for evaluating the RAG system for both Retrieval and Response Evaluation.

Retrieval Evaluation

Conduct retrieval evaluations using `RetrieverEvaluator`. Create the Retriever, define functions for `get_eval_results` and `display_results`, and use Hit Rate and MRR metrics to evaluate retriever performance.

Faithfulness Evaluator

Measure if the response from a query engine matches any source nodes to detect hallucinations. Create service contexts for gpt-3.5-turbo and gpt-4, create a QueryEngine with gpt-3.5-turbo, instantiate FaithfulnessEvaluator, generate responses, and evaluate faithfulness.

Relevancy Evaluator

Measure if the response and source nodes (retrieved context) match the query to verify if the response actually answers the query. Instantiate `RelevancyEvaluator` with gpt-4 and perform relevancy evaluation on queries.

Batch Evaluator

Use LlamaIndex's `BatchEvalRunner` to compute multiple evaluations in a batch-wise manner instead of evaluating Faithfulness and Relevancy independently.

Overview

Evaluate RAG with LlamaIndex

What it does

This prompt chain guides you through building a complete Retrieval Augmented Generation (RAG) pipeline with LlamaIndex and evaluating its performance using quantitative metrics. It covers the five key RAG stages-loading, indexing, storing, querying, and evaluation-then implements retrieval evaluation (hit rate and MRR) and response evaluation (faithfulness and relevancy) using OpenAI's GPT models. The workflow uses Paul Graham essay text as sample data and demonstrates both individual query assessment and batch evaluation.

How it connects

Use this workflow when you need to assess whether your RAG system retrieves relevant context and generates accurate responses before deploying to production. It's ideal when you're comparing embedding strategies, tuning similarity thresholds (like `similarity_top_k`), or establishing baseline performance metrics for a new RAG application. Use it when edge cases and failures are accumulating and manual inspection becomes impractical. Do NOT use this as your only evaluation method if your domain requires specialized accuracy measures beyond faithfulness and relevancy. Do NOT rely on this workfl

Source README

Evaluate RAG with LlamaIndex

In this notebook we will look into building an RAG pipeline and evaluating it with LlamaIndex. It has following 3 sections.

Understanding Retrieval Augmented Generation (RAG).
Building RAG with LlamaIndex.
Evaluating RAG with LlamaIndex.

Retrieval Augmented Generation (RAG)

LLMs are trained on vast datasets, but these will not include your specific data. Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data during the generation process. This is done not by altering the training data of LLMs, but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.

Stages within RAG

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

Loading: this refers to getting your data from where it lives - whether it’s text files, PDFs, another website, a database, or an API - into your pipeline. LlamaHub provides hundreds of connectors to choose from.

Indexing: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

Build RAG system.

Now that we have understood the significance of RAG system, let's build a simple RAG pipeline.

Set Your OpenAI API Key

Let's use Paul Graham Essay text for building RAG pipeline.

Download Data

Load Data and Build Index.

Build a QueryEngine and start querying.

Check response.

By default it retrieves two similar nodes/ chunks. You can modify that in vector_index.as_query_engine(similarity_top_k=k).

Let's check the text in each of these retrieved nodes.

We have built a RAG pipeline and now need to evaluate its performance. We can assess our RAG system/query engine using LlamaIndex's core evaluation modules. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and a range of queries.

While it's beneficial to examine individual queries and responses at the start, this approach may become impractical as the volume of edge cases and failures increases. Instead, it may be more effective to establish a suite of summary metrics or automated evaluations. These tools can provide insights into overall system performance and indicate specific areas that may require closer scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

Retrieval Evaluation: This assesses the accuracy and relevance of the information retrieved by the system.
Response Evaluation: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

Question-Context Pair Generation:

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response. LlamaIndex offers a generate_question_context_pairs module specifically for crafting questions and context pairs which can be used in the assessment of the RAG system of both Retrieval and Response Evaluation. For more details on Question Generation, please refer to the documentation.

Retrieval Evaluation:

We are now prepared to conduct our retrieval evaluations. We will execute our RetrieverEvaluator using the evaluation dataset we have generated.

We first create the Retriever and then define two functions: get_eval_results, which operates our retriever on the dataset, and display_results, which presents the outcomes of the evaluation.

Let's create the retriever.

Define RetrieverEvaluator. We use Hit Rate and MRR metrics to evaluate our Retriever.

Hit Rate:

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

Mean Reciprocal Rank (MRR):

For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on.

Let's check these metrics to check the performance of out retriever.

Let's define a function to display the Retrieval evaluation results in table format.

Observation:

The Retriever with OpenAI Embedding demonstrates a performance with a hit rate of 0.7586, while the MRR, at 0.6206, suggests there's room for improvement in ensuring the most relevant results appear at the top. The observation that MRR is less than the hit rate indicates that the top-ranking results aren't always the most relevant. Enhancing MRR could involve the use of rerankers, which refine the order of retrieved documents. For a deeper understanding of how rerankers can optimize retrieval metrics, refer to the detailed discussion in our blog post.

Response Evaluation:

FaithfulnessEvaluator: Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.
Relevancy Evaluator: Measures if the response + source nodes match the query.

Faithfulness Evaluator

Let's start with FaithfulnessEvaluator.

We will use gpt-3.5-turbo for generating response for a given query and gpt-4 for evaluation.

Let's create service_context seperately for gpt-3.5-turbo and gpt-4.

Create a QueryEngine with gpt-3.5-turbo service_context to generate response for the query.

Create a FaithfulnessEvaluator.

Let's evaluate on one question.

Generate response first and use faithfull evaluator.

Relevancy Evaluator

RelevancyEvaluator is useful to measure if the response and source nodes (retrieved context) match the query. Useful to see if response actually answers the query.

Instantiate RelevancyEvaluator for relevancy evaluation with gpt-4

Let's do relevancy evaluation for one of the query.

Batch Evaluator:

Now that we have done FaithFulness and Relevancy Evaluation independently. LlamaIndex has BatchEvalRunner to compute multiple evaluations in batch wise manner.

Observation:

Faithfulness score of 1.0 signifies that the generated answers contain no hallucinations and are entirely based on retrieved context.

Relevancy score of 1.0 suggests that the answers generated are consistently aligned with the retrieved context and the queries.

Conclusion

In this notebook, we have explored how to build and evaluate a RAG pipeline using LlamaIndex, with a specific focus on evaluating the retrieval system and generated responses within the pipeline.

LlamaIndex offers a variety of other evaluation modules as well, which you can explore further here

Step 1: Understanding Retrieval Augmented Generation (RAG)

Learn about RAG and how it addresses the limitation that LLMs are trained on vast datasets but don't include your specific data. Understand how RAG dynamically incorporates your data during generation by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

Step 2: Building RAG with LlamaIndex

Build a simple RAG pipeline using LlamaIndex. Set your OpenAI API Key, download data (using Paul Graham Essay text), load data and build an index, build a QueryEngine and start querying, and check the responses retrieved by the system.

Step 3: Evaluating RAG with LlamaIndex

Evaluate the performance of your RAG pipeline using LlamaIndex's core evaluation modules. Assess both retrieval accuracy and response quality to determine whether the pipeline produces accurate responses based on data sources and queries.

Step 4: Question-Context Pair Generation

Generate question-context pairs using LlamaIndex's `generate_question_context_pairs` module. These pairs are essential for evaluating the RAG system for both Retrieval and Response Evaluation.

Step 5: Retrieval Evaluation

Conduct retrieval evaluations using `RetrieverEvaluator`. Create the Retriever, define functions for `get_eval_results` and `display_results`, and use Hit Rate and MRR metrics to evaluate retriever performance.

Step 6: Faithfulness Evaluator

Measure if the response from a query engine matches any source nodes to detect hallucinations. Create service contexts for gpt-3.5-turbo and gpt-4, create a QueryEngine with gpt-3.5-turbo, instantiate FaithfulnessEvaluator, generate responses, and evaluate faithfulness.

Step 7: Relevancy Evaluator

Measure if the response and source nodes (retrieved context) match the query to verify if the response actually answers the query. Instantiate `RelevancyEvaluator` with gpt-4 and perform relevancy evaluation on queries.

Step 8: Batch Evaluator

Use LlamaIndex's `BatchEvalRunner` to compute multiple evaluations in a batch-wise manner instead of evaluating Faithfulness and Relevancy independently.

Discussion

Evaluate RAG Pipelines with LlamaIndex

What it gets done

Add it to your toolbox

Steps in the chain

Evaluate RAG with LlamaIndex

What it does

How it connects

Evaluate RAG with LlamaIndex

Build RAG system.

Download Data

Load Data and Build Index.

Evaluation

Question-Context Pair Generation:

Retrieval Evaluation:

Observation:

Response Evaluation:

Faithfulness Evaluator

Relevancy Evaluator

Batch Evaluator:

Observation:

Conclusion

Step 1: Understanding Retrieval Augmented Generation (RAG)

Step 2: Building RAG with LlamaIndex

Step 3: Evaluating RAG with LlamaIndex

Step 4: Question-Context Pair Generation

Step 5: Retrieval Evaluation

Step 6: Faithfulness Evaluator

Step 7: Relevancy Evaluator

Step 8: Batch Evaluator

Questions & comments · 0