Prompt Chain

Automate Image-Based AI Evaluation

OpenAI Evals API workflow for evaluating vision model responses to image prompts using sampling and LLM-as-a-Judge grading against reference answers.


91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Streamline the evaluation of AI models on image-based tasks. This asset automates the process of generating responses to image prompts and scoring them using an LLM judge.

Outcomes

What it gets done

01

Evaluate model responses to image prompts.

02

Utilize LLM-as-a-judge for scoring.

03

Integrate with datasets like VibeEval.

04

Automate testing for image-related AI capabilities.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-evalsapiimageinputs | bash

Steps

Steps in the chain

01
Installing Dependencies + Setup

Install required dependencies and set up the environment for using OpenAI's Evals framework for image-based tasks.

02
Dataset Preparation

Load the VibeEval dataset from Hugging Face. Extract relevant fields (media_url, reference, prompt) and format as JSON. Convert image data to web URLs or base64 encoded strings for use as data source in the Evals API.

03
Data Source Config

Configure the data source based on the compiled dataset. Define the structure and format of the data that will be used for evaluation.

04
Testing Criteria

Set up the grader config with a model grader that takes an image, reference answer, and sampled model response. Configure it to output a score between 0 and 1 based on how closely the model response matches the reference answer and its suitability for the conversation.

05
Eval Configuration

Create the eval object defining the expected structure of the data and testing criteria (grader). The eval specifies how the model responses will be evaluated against the reference answers.

06
Eval Run

Create the run by passing in the eval object id, the data source, and the chat message input for sampling. This generates model responses that will be evaluated.

07
Poll and Display the Results

Wait for the eval run to finish and view the results. Check the OpenAI evals dashboard to see the progress and evaluation results.

08
Viewing Individual Output Items

Examine individual output items from the evaluation run to see detailed results and understand how each item was scored.

Overview

Evals API: Image Inputs

What it does

This prompt chain demonstrates how to evaluate vision-capable language models on image-based tasks using OpenAI's Evals API. It combines sampling (generating model responses to images and prompts) with model grading (LLM-as-a-Judge) to score how well responses align with reference answers. The workflow uses the VibeEval dataset from Hugging Face to test model performance on image understanding tasks.

How it connects

Use this workflow when you need to evaluate how well models respond to images paired with user prompts, and you have reference answers representing high-quality responses. The source mentions potential applications in OCR accuracy and image generation grading. Avoid this approach if you don't have reference answers or ground truth data to compare against, as the model grader requires these for evaluation. Also skip this if you need real-time evaluation-the Evals API is designed for batch assessment, not live inference scoring.

Source README

Evals API: Image Inputs

This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using sampling to generate model responses and model grading (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.

In this example, we will evaluate how well our model can:

  1. Generate appropriate responses to user prompts about images
  2. Align with reference answers that represent high-quality responses

Installing Dependencies + Setup

Dataset Preparation

We use the VibeEval dataset that's hosted on Hugging Face. It contains a collection of user prompt, accompanying image, and reference answer data. First, we load the dataset.

We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs.

If you print the data source list, each item should be of a similar form to:

{
  "item": {
    "media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg"
    "reference": "This appears to be a classic Margherita pizza, which has the following ingredients..."
    "prompt": "What ingredients do I need to make this?"
  }
}

Eval Configuration

Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit API docs.

Evals have two parts, the "Eval" and the "Run". In the "Eval", we define the expected structure of the data and the testing criteria (grader).

Data Source Config

Based on the data that we have compiled, our data source config is as follows:

Testing Criteria

For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the sample namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit API Grader docs.

Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders.

Note: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string.

Now, we create the eval object.

Eval Run

To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response. Note that EvalsAPI also supports stored completions and responses containing images as a data source. See the Additional Info: Logs Data Source section for more info.

Here's the sampling message input we'll use for this example.

We now kickoff an eval run.

Poll and Display the Results

When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results.

Viewing Individual Output Items

To see a full output item, we can do the following. The structure of an output item is specified in the API docs here.

Additional Info: Logs Data Source

As mentioned earlier, EvalsAPI supports logs (i.e., stored completions or responses) containing images as a data source. To use this functionality, change your eval configurations as follows:

Eval Creation

  • set data_source_config = { "type": "logs" }
  • revise templating in grader_config to use {{item.input}} and/or {{sample.output_text}}, denoting the input and output of the log

Eval Run Creation

  • specify the filters in the data_source field that will be used to obtain the corresponding logs for the eval run (see the docs for more information)

Conclusion

In this cookbook, we covered a workflow for evaluating an image-based task using the OpenAI Evals API's. By using the image input functionality for both sampling and model grading, we were able to streamline our evals process for the task.

We're excited to see you extend this to your own image-based use cases, whether it's OCR accuracy, image generation grading, and more!

Step 1: Installing Dependencies + Setup

Install required dependencies and set up the environment for using OpenAI's Evals framework for image-based tasks.

Step 2: Dataset Preparation

Load the VibeEval dataset from Hugging Face. Extract relevant fields (media_url, reference, prompt) and format as JSON. Convert image data to web URLs or base64 encoded strings for use as data source in the Evals API.

Step 3: Data Source Config

Configure the data source based on the compiled dataset. Define the structure and format of the data that will be used for evaluation.

Step 4: Testing Criteria

Set up the grader config with a model grader that takes an image, reference answer, and sampled model response. Configure it to output a score between 0 and 1 based on how closely the model response matches the reference answer and its suitability for the conversation.

Step 5: Eval Configuration

Create the eval object defining the expected structure of the data and testing criteria (grader). The eval specifies how the model responses will be evaluated against the reference answers.

Step 6: Eval Run

Create the run by passing in the eval object id, the data source, and the chat message input for sampling. This generates model responses that will be evaluated.

Step 7: Poll and Display the Results

Wait for the eval run to finish and view the results. Check the OpenAI evals dashboard to see the progress and evaluation results.

Step 8: Viewing Individual Output Items

Examine individual output items from the evaluation run to see detailed results and understand how each item was scored.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.