Automate Image-Based AI Evaluation
OpenAI Evals API workflow for evaluating vision model responses to image prompts using sampling and LLM-as-a-Judge grading against reference answers.
Why it matters
Streamline the evaluation of AI models on image-based tasks. This asset automates the process of generating responses to image prompts and scoring them using an LLM judge.
Outcomes
What it gets done
Evaluate model responses to image prompts.
Utilize LLM-as-a-judge for scoring.
Integrate with datasets like VibeEval.
Automate testing for image-related AI capabilities.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-evalsapiimageinputs | bash Steps
Steps in the chain
Install required dependencies and set up the environment for using OpenAI's Evals framework for image-based tasks.
Load the VibeEval dataset from Hugging Face. Extract relevant fields (media_url, reference, prompt) and format as JSON. Convert image data to web URLs or base64 encoded strings for use as data source in the Evals API.
Configure the data source based on the compiled dataset. Define the structure and format of the data that will be used for evaluation.
Set up the grader config with a model grader that takes an image, reference answer, and sampled model response. Configure it to output a score between 0 and 1 based on how closely the model response matches the reference answer and its suitability for the conversation.
Create the eval object defining the expected structure of the data and testing criteria (grader). The eval specifies how the model responses will be evaluated against the reference answers.
Create the run by passing in the eval object id, the data source, and the chat message input for sampling. This generates model responses that will be evaluated.
Wait for the eval run to finish and view the results. Check the OpenAI evals dashboard to see the progress and evaluation results.
Examine individual output items from the evaluation run to see detailed results and understand how each item was scored.
Overview
Evals API: Image Inputs
What it does
This prompt chain demonstrates how to evaluate vision-capable language models on image-based tasks using OpenAI's Evals API. It combines sampling (generating model responses to images and prompts) with model grading (LLM-as-a-Judge) to score how well responses align with reference answers. The workflow uses the VibeEval dataset from Hugging Face to test model performance on image understanding tasks.
How it connects
Use this workflow when you need to evaluate how well models respond to images paired with user prompts, and you have reference answers representing high-quality responses. The source mentions potential applications in OCR accuracy and image generation grading. Avoid this approach if you don't have reference answers or ground truth data to compare against, as the model grader requires these for evaluation. Also skip this if you need real-time evaluation-the Evals API is designed for batch assessment, not live inference scoring.
Source README
Evals API: Image Inputs
This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using sampling to generate model responses and model grading (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.
In this example, we will evaluate how well our model can:
- Generate appropriate responses to user prompts about images
- Align with reference answers that represent high-quality responses
Installing Dependencies + Setup
Dataset Preparation
We use the VibeEval dataset that's hosted on Hugging Face. It contains a collection of user prompt, accompanying image, and reference answer data. First, we load the dataset.
We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs.
If you print the data source list, each item should be of a similar form to:
{
"item": {
"media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg"
"reference": "This appears to be a classic Margherita pizza, which has the following ingredients..."
"prompt": "What ingredients do I need to make this?"
}
}
Eval Configuration
Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit API docs.
Evals have two parts, the "Eval" and the "Run". In the "Eval", we define the expected structure of the data and the testing criteria (grader).
Data Source Config
Based on the data that we have compiled, our data source config is as follows:
Testing Criteria
For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the sample namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit API Grader docs.
Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders.
Note: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string.
Now, we create the eval object.
Eval Run
To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response. Note that EvalsAPI also supports stored completions and responses containing images as a data source. See the Additional Info: Logs Data Source section for more info.
Here's the sampling message input we'll use for this example.
We now kickoff an eval run.
Poll and Display the Results
When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results.
Viewing Individual Output Items
To see a full output item, we can do the following. The structure of an output item is specified in the API docs here.
Additional Info: Logs Data Source
As mentioned earlier, EvalsAPI supports logs (i.e., stored completions or responses) containing images as a data source. To use this functionality, change your eval configurations as follows:
Eval Creation
- set
data_source_config = { "type": "logs" } - revise templating in
grader_configto use{{item.input}}and/or{{sample.output_text}}, denoting the input and output of the log
Eval Run Creation
- specify the filters in the
data_sourcefield that will be used to obtain the corresponding logs for the eval run (see the docs for more information)
Conclusion
In this cookbook, we covered a workflow for evaluating an image-based task using the OpenAI Evals API's. By using the image input functionality for both sampling and model grading, we were able to streamline our evals process for the task.
We're excited to see you extend this to your own image-based use cases, whether it's OCR accuracy, image generation grading, and more!
Step 1: Installing Dependencies + Setup
Install required dependencies and set up the environment for using OpenAI's Evals framework for image-based tasks.
Step 2: Dataset Preparation
Load the VibeEval dataset from Hugging Face. Extract relevant fields (media_url, reference, prompt) and format as JSON. Convert image data to web URLs or base64 encoded strings for use as data source in the Evals API.
Step 3: Data Source Config
Configure the data source based on the compiled dataset. Define the structure and format of the data that will be used for evaluation.
Step 4: Testing Criteria
Set up the grader config with a model grader that takes an image, reference answer, and sampled model response. Configure it to output a score between 0 and 1 based on how closely the model response matches the reference answer and its suitability for the conversation.
Step 5: Eval Configuration
Create the eval object defining the expected structure of the data and testing criteria (grader). The eval specifies how the model responses will be evaluated against the reference answers.
Step 6: Eval Run
Create the run by passing in the eval object id, the data source, and the chat message input for sampling. This generates model responses that will be evaluated.
Step 7: Poll and Display the Results
Wait for the eval run to finish and view the results. Check the OpenAI evals dashboard to see the progress and evaluation results.
Step 8: Viewing Individual Output Items
Examine individual output items from the evaluation run to see detailed results and understand how each item was scored.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.