Evaluate Model Answers with Custom Datasets
Jupyter notebook demonstrating OpenAI Evals framework with custom datasets to compare gpt-4.1 and o4-mini models answering tiktoken repository questions via
Why it matters
Assess the accuracy and relevance of AI model responses against a custom dataset, comparing different models and tool integrations for performance insights.
Outcomes
What it gets done
Set up and run evaluations using OpenAI Evals with custom datasets.
Compare model performance using repository-aware tools like MCP.
Define and implement LLM-based and programmatic grading logic.
Analyze and interpret evaluation outputs for model improvement.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-mcpevalnotebook | bash Steps
Steps in the chain
Import the required libraries and configure the OpenAI client. This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.
Define a small, in-memory dataset of question-answer pairs about the tiktoken repository. Each item contains a query (the user's question) and an answer (the expected ground truth). You can modify or extend this dataset to suit your own use case or repository.
Set up two graders: an LLM-based grader that checks if the model's answer matches the expected answer or conveys the same meaning, and a Python MCP Grader that checks whether the model actually used the MCP tool during its response. Using both provides a more robust and transparent evaluation.
Configure the evaluation using the OpenAI Evals framework. Specify the evaluation name and dataset, the schema for each item, the grader(s) to use, and the passing criteria and labels. Clearly defining your evaluation schema and grading logic ensures reproducibility and transparency.
Run the evaluation for each model (gpt-4.1 and o4-mini). Each run is configured to use the MCP tool for repository-aware answers, use the same dataset and evaluation configuration for fair comparison, and specify model-specific parameters. Keeping the evaluation setup consistent across models ensures results are comparable and reliable.
After launching the evaluation runs, poll the runs until they are complete. This ensures that you are analyzing results only after all model responses have been processed. Polling with a delay avoids excessive API calls and ensures efficient resource usage.
Display the outputs from each model for manual inspection and further analysis. Each model's answers are printed for each question in the dataset. Compare the outputs side-by-side to assess quality, relevance, and correctness using the OpenAI Evals Dashboard.
Overview
Evaluating MCP-Based Answers with a Custom Dataset
What it does
A hands-on Jupyter notebook that walks through setting up and running OpenAI Evals with a custom in-memory dataset, comparing gpt-4.1 and o4-mini models on their ability to answer questions about the tiktoken GitHub repository using MCP tools, with both LLM-based and programmatic graders for robust assessment.
How it connects
Use this notebook when you need to evaluate and compare LLM performance on technical Q&A tasks with custom datasets, when you want to audit tool usage alongside answer quality, or when establishing reproducible evaluation workflows for MCP-based applications.
Source README
Evaluating MCP-Based Answers with a Custom Dataset
This notebook evaluates a model's ability to answer questions about the tiktoken GitHub repository using the OpenAI Evals framework with a custom in-memory dataset.
We use a custom, in-memory dataset of Q&A pairs and compare two models: gpt-4.1 and o4-mini, that leverage the MCP tool for repository-aware, contextually accurate answers.
Goals:
- Show how to set up and run an evaluation using OpenAI Evals with a custom dataset.
- Compare the performance of different models leveraging MCP-based tools.
- Provide best practices for professional, reproducible evaluation workflows.
Next: We will set up our environment and import the necessary libraries.
Environment Setup
We begin by importing the required libraries and configuring the OpenAI client.
This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.
Define the Custom Evaluation Dataset
We define a small, in-memory dataset of question-answer pairs about the tiktoken repository.
This dataset will be used to test the models' ability to provide accurate and relevant answers with the help of the MCP tool.
- Each item contains a
query(the user’s question) and ananswer(the expected ground truth). - You can modify or extend this dataset to suit your own use case or repository.
Define Grading Logic
To evaluate the model’s answers, we use two graders:
Pass/Fail Grader (LLM-based):
An LLM-based grader that checks if the model’s answer matches the expected answer (ground truth) or conveys the same meaning.Python MCP Grader:
A Python function that checks whether the model actually used the MCP tool during its response (for auditing tool usage).Best Practice:
Using both LLM-based and programmatic graders provides a more robust and transparent evaluation.
Define the Evaluation Configuration
We now configure the evaluation using the OpenAI Evals framework.
This step specifies:
- The evaluation name and dataset.
- The schema for each item (what fields are present in each Q&A pair).
- The grader(s) to use (LLM-based and/or Python-based).
- The passing criteria and labels.
Best Practice:
Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency.
Run Evaluations for Each Model
We now run the evaluation for each model (gpt-4.1 and o4-mini).
Each run is configured to:
- Use the MCP tool for repository-aware answers.
- Use the same dataset and evaluation configuration for fair comparison.
- Specify model-specific parameters (such as max completions tokens, and allowed tools).
Best Practice:
Keeping the evaluation setup consistent across models ensures results are comparable and reliable.
Poll for Completion and Retrieve Outputs
After launching the evaluation runs, we can poll the run until they are complete.
This step ensures that we are analyzing results only after all model responses have been processed.
Best Practice:
Polling with a delay avoids excessive API calls and ensures efficient resource usage.
Display and Interpret Model Outputs
Finally, we display the outputs from each model for manual inspection and further analysis.
- Each model's answers are printed for each question in the dataset.
- You can compare the outputs side-by-side to assess quality, relevance, and correctness.
Below are screenshots from the OpenAI Evals Dashboard illustrating the evaluation outputs for both models:
For a comprehensive breakdown of the evaluation metrics and results, navigate to the "Data" tab in the dashboard:
Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.
We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis.
How can we improve?
If we add the phrase "Always use your tools since they are the way to get the right answer in this task." to the system message of the o4-mini model, what do you think will happen? (try it out)
If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!
In this notebook, we demonstrated a sample workflow for evaluating the ability of LLMs to answer technical questions about the tiktoken repository using the OpenAI Evals framework leveraging MCP tooling.
Key points covered:
- Defined a focused, custom dataset for evaluation.
- Configured LLM-based and Python-based graders for robust assessment.
- Compared two models (
gpt-4.1ando4-mini) in a reproducible and transparent manner. - Retrieved and displayed model outputs for automated/manual inspection.
Next steps:
- Expand the dataset: Add more diverse and challenging questions to better assess model capabilities.
- Analyze results: Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.
- Experiment with models/tools: Try additional models, adjust tool configurations, or test on other repositories.
- Automate reporting: Generate summary tables or plots for easier sharing and decision-making.
For more information, check out the OpenAI Evals documentation.
Step 1: Environment Setup
Import the required libraries and configure the OpenAI client. This step ensures we have access to the OpenAI API and all necessary utilities for evaluation.
Step 2: Define the Custom Evaluation Dataset
Define a small, in-memory dataset of question-answer pairs about the tiktoken repository. Each item contains a query (the user's question) and an answer (the expected ground truth). You can modify or extend this dataset to suit your own use case or repository.
Step 3: Define Grading Logic
Set up two graders: an LLM-based grader that checks if the model's answer matches the expected answer or conveys the same meaning, and a Python MCP Grader that checks whether the model actually used the MCP tool during its response. Using both provides a more robust and transparent evaluation.
Step 4: Define the Evaluation Configuration
Configure the evaluation using the OpenAI Evals framework. Specify the evaluation name and dataset, the schema for each item, the grader(s) to use, and the passing criteria and labels. Clearly defining your evaluation schema and grading logic ensures reproducibility and transparency.
Step 5: Run Evaluations for Each Model
Run the evaluation for each model (gpt-4.1 and o4-mini). Each run is configured to use the MCP tool for repository-aware answers, use the same dataset and evaluation configuration for fair comparison, and specify model-specific parameters. Keeping the evaluation setup consistent across models ensures results are comparable and reliable.
Step 6: Poll for Completion and Retrieve Outputs
After launching the evaluation runs, poll the runs until they are complete. This ensures that you are analyzing results only after all model responses have been processed. Polling with a delay avoids excessive API calls and ensures efficient resource usage.
Step 7: Display and Interpret Model Outputs
Display the outputs from each model for manual inspection and further analysis. Each model's answers are printed for each question in the dataset. Compare the outputs side-by-side to assess quality, relevance, and correctness using the OpenAI Evals Dashboard.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.