Evaluations Example: Push Notifications Summarizer Prompt Regression,

Evals are task oriented and iterative, they're the best way to check how your LLM integration is doing and improve it.

In the following eval, we are going to focus on the task of detecting if my prompt change is a regression.

Our use-case is:

I have an llm integration that takes a list of push notifications and summarizes them into a single condensed statement.
I want to detect if a prompt change regresses the behavior

Evals structure

Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval can have many runs that are evaluated by your testing criteria.

Use-case

We're testing the following integration, a push notifications summary, which takes in multiple push notifications and collapses them into a single one, this is a chat completions call.

Setting up your eval

An Eval holds the configuration that is shared across multiple Runs, it has two components:

Data source configuration data_source_config - the schema (columns) that your future Runs conform to.
- The data_source_config uses JSON Schema to define what variables are available in the Eval.
Testing Criteria testing_criteria - How you'll determine if your integration is working for each row of your data source.

For this use-case, we want to test if the push notification summary completion is good, so we'll set-up our eval with this in mind.

This data_source_config defines what variables are available throughout the eval.

This item schema:

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

Means that we'll have the variable {{item.notifications}} available in our eval.

"include_sample_schema": True
Mean's that we'll have the variable {{sample.output_text}} available in our eval.

Now, we'll use those variables to set up our test criteria.

The push_notification_grader is a model grader (llm-as-a-judge), which looks at the input {{item.notifications}} and the generated summary {{sample.output_text}} and labels it as "correct" or "incorrect".
We then instruct via. the "passing_labels", what constitutes a passing answer.

Note: under the hood, this uses structured outputs so that labels are always valid.

Now we'll create our eval!, and start adding data to it

Creating runs

Now that we have our eval set-up with our test_criteria, we can start to add a bunch of runs!
We'll start with some push notification data.

Our first run will be our default grader from the completions function above summarize_push_notification
We'll loop through our dataset, make completions calls, and then submit them as a run to be graded.

Now let's simulate a regression, here's our original prompt, let's simulate a developer breaking the prompt.

DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""

If you view that report, you'll see that it has a score that's much lower than the baseline-run.

Congratulations, you just prevented a bug from shipping to users

Quick note:
Evals doesn't yet support the responses api natively, however, you can transform it to the completions format with the following code.

Evaluations Example: Push Notifications Summarizer Prompt Regression,

Get this prompt chain