Prompt Chain

Evaluate Model Answers with Custom Datasets

Name: Evaluate Model Answers with Custom Datasets
Availability: OnlineOnly
Author: OpenAI Cookbook

Jupyter notebook demonstrating OpenAI Evals framework with custom datasets to compare gpt-4.1 and o4-mini models answering tiktoken repository questions via

Copy chain

Works with openai

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 3 months ago

Version 1.0.0

Models

gpt 4ogpt 4universal

Add to Favorites

Why it matters

Assess the accuracy and relevance of AI model responses against a custom dataset, comparing different models and tool integrations for performance insights.

Outcomes

What it gets done

Set up and run evaluations using OpenAI Evals with custom datasets.

Compare model performance using repository-aware tools like MCP.

Define and implement LLM-based and programmatic grading logic.

Analyze and interpret evaluation outputs for model improvement.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-mcpevalnotebook | bash

Steps

Steps in the chain

Environment Setup

Import the required libraries and configure the OpenAI client. This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.

Define the Custom Evaluation Dataset

Define a small, in-memory dataset of question-answer pairs about the tiktoken repository. Each item contains a query (the user's question) and an answer (the expected ground truth). You can modify or extend this dataset to suit your own use case or repository.

Define Grading Logic

Set up two graders: an LLM-based grader that checks if the model's answer matches the expected answer or conveys the same meaning, and a Python MCP Grader that checks whether the model actually used the MCP tool during its response. Using both provides a more robust and transparent evaluation.

Define the Evaluation Configuration

Configure the evaluation using the OpenAI Evals framework. Specify the evaluation name and dataset, the schema for each item, the grader(s) to use, and the passing criteria and labels. Clearly defining your evaluation schema and grading logic ensures reproducibility and transparency.

Run Evaluations for Each Model

Run the evaluation for each model (gpt-4.1 and o4-mini). Each run is configured to use the MCP tool for repository-aware answers, use the same dataset and evaluation configuration for fair comparison, and specify model-specific parameters. Keeping the evaluation setup consistent across models ensures results are comparable and reliable.

Poll for Completion and Retrieve Outputs

After launching the evaluation runs, poll the runs until they are complete. This ensures that you are analyzing results only after all model responses have been processed. Polling with a delay avoids excessive API calls and ensures efficient resource usage.

Display and Interpret Model Outputs

Display the outputs from each model for manual inspection and further analysis. Each model's answers are printed for each question in the dataset. Compare the outputs side-by-side to assess quality, relevance, and correctness using the OpenAI Evals Dashboard.

Overview

Evaluating MCP-Based Answers with a Custom Dataset

What it does

A hands-on Jupyter notebook that walks through setting up and running OpenAI Evals with a custom in-memory dataset, comparing gpt-4.1 and o4-mini models on their ability to answer questions about the tiktoken GitHub repository using MCP tools, with both LLM-based and programmatic graders for robust assessment.

How it connects

Use this notebook when you need to evaluate and compare LLM performance on technical Q&A tasks with custom datasets, when you want to audit tool usage alongside answer quality, or when establishing reproducible evaluation workflows for MCP-based applications.

Source README

Evaluating MCP-Based Answers with a Custom Dataset

This notebook evaluates a model's ability to answer questions about the tiktoken GitHub repository using the OpenAI Evals framework with a custom in-memory dataset.

We use a custom, in-memory dataset of Q&A pairs and compare two models: gpt-4.1 and o4-mini, that leverage the MCP tool for repository-aware, contextually accurate answers.

Goals:

Show how to set up and run an evaluation using OpenAI Evals with a custom dataset.
Compare the performance of different models leveraging MCP-based tools.
Provide best practices for professional, reproducible evaluation workflows.

Next: We will set up our environment and import the necessary libraries.

Environment Setup

We begin by importing the required libraries and configuring the OpenAI client.
This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.

Define the Custom Evaluation Dataset

We define a small, in-memory dataset of question-answer pairs about the tiktoken repository.
This dataset will be used to test the models' ability to provide accurate and relevant answers with the help of the MCP tool.

Each item contains a query (the user’s question) and an answer (the expected ground truth).
You can modify or extend this dataset to suit your own use case or repository.

Define Grading Logic

To evaluate the model’s answers, we use two graders:

Pass/Fail Grader (LLM-based):
An LLM-based grader that checks if the model’s answer matches the expected answer (ground truth) or conveys the same meaning.
Python MCP Grader:
A Python function that checks whether the model actually used the MCP tool during its response (for auditing tool usage).

Best Practice:
Using both LLM-based and programmatic graders provides a more robust and transparent evaluation.

Define the Evaluation Configuration

We now configure the evaluation using the OpenAI Evals framework.

This step specifies:

The evaluation name and dataset.
The schema for each item (what fields are present in each Q&A pair).
The grader(s) to use (LLM-based and/or Python-based).
The passing criteria and labels.

Best Practice:
Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency.

Run Evaluations for Each Model

We now run the evaluation for each model (gpt-4.1 and o4-mini).

Each run is configured to:

Use the MCP tool for repository-aware answers.
Use the same dataset and evaluation configuration for fair comparison.
Specify model-specific parameters (such as max completions tokens, and allowed tools).

Best Practice:
Keeping the evaluation setup consistent across models ensures results are comparable and reliable.

Poll for Completion and Retrieve Outputs

After launching the evaluation runs, we can poll the run until they are complete.

This step ensures that we are analyzing results only after all model responses have been processed.

Best Practice:
Polling with a delay avoids excessive API calls and ensures efficient resource usage.

Display and Interpret Model Outputs

Finally, we display the outputs from each model for manual inspection and further analysis.

Each model's answers are printed for each question in the dataset.
You can compare the outputs side-by-side to assess quality, relevance, and correctness.

Below are screenshots from the OpenAI Evals Dashboard illustrating the evaluation outputs for both models:

For a comprehensive breakdown of the evaluation metrics and results, navigate to the "Data" tab in the dashboard:

Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.

We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis.

How can we improve?

If we add the phrase "Always use your tools since they are the way to get the right answer in this task." to the system message of the o4-mini model, what do you think will happen? (try it out)

If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!

In this notebook, we demonstrated a sample workflow for evaluating the ability of LLMs to answer technical questions about the tiktoken repository using the OpenAI Evals framework leveraging MCP tooling.

Key points covered:

Defined a focused, custom dataset for evaluation.
Configured LLM-based and Python-based graders for robust assessment.
Compared two models (gpt-4.1 and o4-mini) in a reproducible and transparent manner.
Retrieved and displayed model outputs for automated/manual inspection.

Next steps:

Expand the dataset: Add more diverse and challenging questions to better assess model capabilities.
Analyze results: Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.
Experiment with models/tools: Try additional models, adjust tool configurations, or test on other repositories.
Automate reporting: Generate summary tables or plots for easier sharing and decision-making.

For more information, check out the OpenAI Evals documentation.

Step 1: Environment Setup

Import the required libraries and configure the OpenAI client. This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.

Step 2: Define the Custom Evaluation Dataset

Define a small, in-memory dataset of question-answer pairs about the tiktoken repository. Each item contains a query (the user's question) and an answer (the expected ground truth). You can modify or extend this dataset to suit your own use case or repository.

Step 3: Define Grading Logic

Set up two graders: an LLM-based grader that checks if the model's answer matches the expected answer or conveys the same meaning, and a Python MCP Grader that checks whether the model actually used the MCP tool during its response. Using both provides a more robust and transparent evaluation.

Step 4: Define the Evaluation Configuration

Configure the evaluation using the OpenAI Evals framework. Specify the evaluation name and dataset, the schema for each item, the grader(s) to use, and the passing criteria and labels. Clearly defining your evaluation schema and grading logic ensures reproducibility and transparency.

Step 5: Run Evaluations for Each Model

Run the evaluation for each model (gpt-4.1 and o4-mini). Each run is configured to use the MCP tool for repository-aware answers, use the same dataset and evaluation configuration for fair comparison, and specify model-specific parameters. Keeping the evaluation setup consistent across models ensures results are comparable and reliable.

Step 6: Poll for Completion and Retrieve Outputs

After launching the evaluation runs, poll the runs until they are complete. This ensures that you are analyzing results only after all model responses have been processed. Polling with a delay avoids excessive API calls and ensures efficient resource usage.

Step 7: Display and Interpret Model Outputs

Display the outputs from each model for manual inspection and further analysis. Each model's answers are printed for each question in the dataset. Compare the outputs side-by-side to assess quality, relevance, and correctness using the OpenAI Evals Dashboard.

Discussion

Evaluate Model Answers with Custom Datasets

What it gets done

Add it to your toolbox

Steps in the chain

Evaluating MCP-Based Answers with a Custom Dataset

What it does

How it connects

Evaluating MCP-Based Answers with a Custom Dataset

Environment Setup

Define the Custom Evaluation Dataset

Define Grading Logic

Define the Evaluation Configuration

Run Evaluations for Each Model

Poll for Completion and Retrieve Outputs

Display and Interpret Model Outputs

How can we improve?

Step 1: Environment Setup

Step 2: Define the Custom Evaluation Dataset

Step 3: Define Grading Logic

Step 4: Define the Evaluation Configuration

Step 5: Run Evaluations for Each Model

Step 6: Poll for Completion and Retrieve Outputs

Step 7: Display and Interpret Model Outputs

Questions & comments · 0