Prompt Chain

Evaluate Audio Responses with Evals API

Cookbook demonstrating OpenAI Evals API for audio-based model evaluation using native audio inputs, model sampling, and audio grading without transcription.


91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models
claude 3 opusgpt 4o

Add to Favorites

Why it matters

Automate the evaluation of AI model responses to audio inputs, ensuring accuracy and alignment with reference answers for tasks like customer support.

Outcomes

What it gets done

01

Process and prepare audio data for evaluation.

02

Configure and run evaluations using the OpenAI Evals API.

03

Grade audio responses using model-based or text-based graders.

04

Analyze evaluation results to assess model performance.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-evalsapiaudioinputs | bash

Steps

Steps in the chain

01
Installing Dependencies + Setup

Install required dependencies and set up the environment for using OpenAI's Evals framework with audio inputs.

02
Dataset Preparation

Extract relevant fields from the big_bench_audio dataset and convert them to JSON format with base64-encoded audio strings. Process audio files (WAV, MP3, FLAC, Opus, or PCM16 formats) and structure data with id, category, official_answer, and audio_base64 fields.

03
Eval Configuration

Create evals by defining the expected data structure and testing criteria. Save examples to a file and upload to the API. Evals consist of two parts: the Eval (defining data structure and grader) and the Run (executing the evaluation).

04
Data Source Configuration

Configure the data source based on the compiled dataset with proper formatting for the Evals API.

05
Testing Criteria

Set up grader configuration using a score_model grader that compares the official answer with sampled model response and outputs a score of 0 or 1. Alternatively, use a string_check grader to compare text transcripts with a score between 0 and 1.

06
Eval Run

Create and execute the run by passing the eval object id, data source, and chat message input for sampling to generate model responses.

07
Poll and Display the Results

Wait for the run to finish and review results. Check the OpenAI Evals dashboard to monitor progress and view evaluation results.

08
Viewing Individual Output Items

Examine individual output items to see detailed results. Review the structure of output items as specified in the API documentation.

Overview

Evals API: Audio Inputs

What it does

A technical cookbook showing how to configure OpenAI's Evals API to evaluate models on audio inputs without prior transcription, using the big_bench_audio dataset for reasoning tasks.

How it connects

Use when you need to evaluate audio-capable models on native audio workflows, configure data sources with base64-encoded audio files, or set up graders that score audio responses or their text transcripts against reference answers.

Source README

Evals API: Audio Inputs

This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using sampling to generate model responses and model grading to score the model responses against the output audio and reference answer. Note that grading will be on audio outputs from the sampled response.

Before audio support was added, to evaluate audio conversations, they first needed to be transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This more accurately represents workflows such as a customer support scenario where both the user and the agent are using audio. For grading, we will use an audio model to grade the audio response with a model grader. We could alternatively, or in combination, use the text transcript from the sampled audio and leverage the existing suite of text graders.

In this example, we will evaluate how well our model can:

  1. Generate appropriate responses to user prompts about an audio message
  2. Align with reference answers that represent high-quality responses

Installing Dependencies + Setup

Dataset Preparation

We use the big_bench_audio dataset that is hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category, and an official answer.

We extract the relevant fields and put them in a JSON-like format to pass in as a data source in the Evals API. Input audio data must be in the form of a base64-encoded string. We process the data in the audio file and convert it to base64.

Note: Audio models currently support WAV, MP3, FLAC, Opus, or PCM16 formats. See audio inputs for details.

If you print the data source list, each item should be of a similar form to:

{
  "item": {
    "id": 0
    "category": "formal_fallacies"
    "official_answer": "invalid"
    "audio_base64": "UklGRjrODwBXQVZFZm10IBAAAAABAAEAIlYAAESsA..."
  }
}

Eval Configuration

Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit API docs.

Since audio inputs are large, we need to save the examples to a file and upload it to the API.

Evals have two parts: the "Eval" and the "Run". In the "Eval" we define the expected structure of the data and the testing criteria (grader).

Data Source Configuration

Based on the data that we have compiled, our data source configuration is as follows:

Testing Criteria

For our testing criteria, we set up our grader configuration. In this example, we use a score_model grader that takes in the official answer and sampled model response (in the sample namespace), and then outputs a score of 0 or 1 based on whether the model response matches the official answer. The response contains both audio and the text transcript of the audio. We will use the audio in the grader. For more information on graders, visit API Grader docs.

Getting both the data and the grader right is key for an effective evaluation. You will likely want to iteratively refine the prompts for your graders.

Alternatively we could use a string_check grader that takes in the official answer and sampled model response (in the sample namespace), and then outputs a score between 0 and 1 based on if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader.

grader_config = {
  "type": "string_check",
  "name": "String check grader",
  "input": "{{sample.output_text}}",
  "reference": "{{item.official_answer}}",
  "operation": "ilike"
}

Now, we create the eval object.

Eval Run

To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response.

Here's the sampling message input we'll use for this example.

We now kick off an eval run.

Poll and Display the Results

When the run finishes, we can take a look at the result. You can also check your organization's OpenAI Evals dashboard to see the progress and results.

Viewing Individual Output Items

To see a full output item, we can do the following. The structure of an output item is specified in the API docs here.

Conclusion

In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API. We demonstrated using a score model grader to grade the audio response.

Next steps

  • Convert this example to your own use case.
  • If you have large audio clips, try using the uploads API for support up to 8 GB.
  • Navigate to the Evals dashboard to visualize the outputs and get insights into the performance of the eval.

Step 1: Installing Dependencies + Setup

Install required dependencies and set up the environment for using OpenAI's Evals framework with audio inputs.

Step 2: Dataset Preparation

Extract relevant fields from the big_bench_audio dataset and convert them to JSON format with base64-encoded audio strings. Process audio files (WAV, MP3, FLAC, Opus, or PCM16 formats) and structure data with id, category, official_answer, and audio_base64 fields.

Step 3: Eval Configuration

Create evals by defining the expected data structure and testing criteria. Save examples to a file and upload to the API. Evals consist of two parts: the Eval (defining data structure and grader) and the Run (executing the evaluation).

Step 4: Data Source Configuration

Configure the data source based on the compiled dataset with proper formatting for the Evals API.

Step 5: Testing Criteria

Set up grader configuration using a score_model grader that compares the official answer with sampled model response and outputs a score of 0 or 1. Alternatively, use a string_check grader to compare text transcripts with a score between 0 and 1.

Step 6: Eval Run

Create and execute the run by passing the eval object id, data source, and chat message input for sampling to generate model responses.

Step 7: Poll and Display the Results

Wait for the run to finish and review results. Check the OpenAI Evals dashboard to monitor progress and view evaluation results.

Step 8: Viewing Individual Output Items

Examine individual output items to see detailed results. Review the structure of output items as specified in the API documentation.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.