Building an LLM-as-a-judge evaluation to detect hallucinations with Braintrust
Let's say you're working on a customer service bot and trying to evaluate the quality of its responses. Consider a question like "What is your return policy?" If the correct answer is "You can return items within 30 days of purchase," but your bot generates "You can return items within 30 days," how would you evaluate whether this is a good response?
Get this prompt chain
Building an LLM-as-a-judge evaluation to detect hallucinations with Braintrust
Let's say you're working on a customer service bot and trying to evaluate the quality of its responses. Consider a question like "What is your return policy?" If the correct answer is "You can return items within 30 days of purchase," but your bot generates "You can return items within 30 days," how would you evaluate whether this is a good response?
A heuristic like the Levenshtein string distance would indicate that the response is incorrect. However, a better approach is to use an LLM-as-a-judge to assess the accuracy of the response. LLM-as-a-judge is a technique that leverages an LLM to score the quality of answers. LLMs can reason about language beyond surface-level string comparisons, enabling them to evaluate answers more accurately.
In this cookbook, we'll walk through how to build an LLM-as-a-judge scorer that can detect hallucinations using Braintrust, a third-party evaluation platform that is compatible with OpenAI's models.
Installing dependencies
Let's install a few basic dependencies. We'll use the CoQA dataset (via DuckDB), Braintrust for evals, and OpenAI's models. Please note that Braintrust is a third-party evaluation platform and you should review their terms of service and privacy policy before proceeding.
Next, let's initialize the OpenAI client. We'll use the AsyncOpenAI client so that we can parallelize our requests. The braintrust.wrap_openai function
wraps the OpenAI client to enable logging LLM calls to Braintrust. We'll use Braintrust to facilitate the evaluations below.
Before proceeding, you should sign up for a Braintrust account and set BRAINTRUST_API_KEY in your environment to a valid API key.
Explore the dataset
We'll use the CoQA dataset which contains a diverse set of passages, questions, and answers. Because CoQA is quite large, we'll just look at the first several passages. As with any public dataset, there's a chance that the underlying LLMs have memorized aspects of the dataset, so when developing your own scorers, it's a good idea to test them using
your own private data.
The data contains a series of passages, each with a number of questions and answers. Let's flatten this into a list of (passage, question, answer) tuples.
Adding hallucinations
Because Braintrust's scorer is designed to test hallucinations, we can use the QA pairs to generate known hallucinations. We'll create hallucinated answers by asking an
LLM to confidently generate an answer to each question without using the passage.
Creating the evaluators
We'll consider a few popular approaches for creating an LLM-as-a-judge. For each approach, we'll create a scorer and then "meta-evaluate" it to see how it performs.
Since we know that the hallucinated answers are incorrect, we'll assess the quality of an evaluator by testing how often it scores the hallucinated answers as 0.
LLM-as-a-judge #1: Numeric rater
A common initial intuition when creating an LLM-as-a-judge is asking the LLM to rate the answer on a scale of 1 to 5. The benefit of this approach is that
it's easy to convert the LLM's output into a numeric score.
We'll use a modified version of the Factuality template, but ask the LLM to
rate the answer on a scale of 1 to 10.
This looks promising! Now that we have sanity checked it on a single example, let's run a proper evaluation and see how it performs on a wider set of data. An evaluation consists of three components:
- Data: In this case, the
inputis the question, hallucinated answer, and ground truth answer. The scorer will convert this into a score between 0 and 1. The expected score is 0, since it's a hallucination. - Task: The task is simply calling the numeric rater for each input.
- Scores: We'll assess the quality of the generated score by comparing it with the ground truth score. Since we know both numbers are between 0 and 1, we can use the normalized difference as the score.
It looks like the numeric rater scored almost 94% in total. That's not bad, but if 6% of your evals are incorrectly judged, that could make it very hard to trust them. Let's dig into the Braintrust
UI to get some insight into what's going on.

It looks like a number of the incorrect answers were scored with numbers between 1 and 10. However, we do not currently have any insight into why the model gave these scores. Let's see if we can
fix that next.
LLM-as-a-judge #2: Adding reasoning
Let's tweak the prompt to get the LLM to also reason about its rating. This method is called Chain of Thought Reasoning. In addition
to potentially improving the score, it will give us some insight into why the model gave these scores.
It doesn't look like adding reasoning helped the score (in fact, it's 3% percent worse). However, if we look at one of the failures, we'll get some insight into
what the model was thinking. Here is an example of a hallucinated answer:

And the score along with its reasoning:

It looks like the model is applying its own judgement to compute partial credit. This is a common problem with numeric rating—both for models and for humans—and can often be solved
by using better prompting.
LLM-as-a-judge #3: Classifying instead of rating
Next, we'll spell out specific criteria and ask the model to classify the answer according to those criteria. This method allows us to more precisely guide the model
towards the hallucinations we're testing for. Intuitively, giving the model specific criteria to rate will result in a more accurate score.
The classifier scored 98% which is a significant improvement!
Codifying this pattern
The classifier above can simply be rewritten as:
PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""
Classifier = autoevals.LLMClassifier(
name="Hallucination detector",
prompt_template=PROMPT,
choice_scores={"A": 0.5, "B": 0, "C": 1, "D": 0, "E": 1},
use_cot=True,
)
Next steps
As a next step, you could dig into the individual improvements and regressions to assess them and consider future improvements to the prompt. You could also test it on your own data, and double check that the results hold for your use case.
You could also measure a model like o1, try fine-tuning a smaller model and see if the results are reproducible, or use few-shot prompting to align the model with more subjective criteria.
In all cases, you should strive to evaluate your results, so you can rigorously assess the impact of each change.
