Prompt Chain

Analyze Agentic System Performance Across Many Runs

Name: Analyze Agentic System Performance Across Many Runs
Availability: OnlineOnly
Author: OpenAI Cookbook

Multi-step workflow for discovering recurring behavior patterns across thousands of agentic system traces using lower-level eval labels and population-level

Copy chain

Works with openai

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 17 days ago

Version 1.0.0

Models

claude

Add to Favorites

Why it matters

Improve multi-agent systems by identifying recurring failure patterns across thousands of agent runs. This asset helps you move from raw trace data to actionable insights for both technical and business stakeholders.

Outcomes

What it gets done

Generate or collect traced agent runs.

Run lower-level evaluations on individual agent runs.

Aggregate findings into compact documents for pattern discovery.

Identify and drill into high-impact recurring behavior patterns.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-macroevalsforagenticsystems | bash

Steps

Steps in the chain

Generate or collect many traced agent runs

Generate or collect many traced agent runs from your agentic system. In this example, a synthetic EV order workflow produces 1,000 simulated customer-order interactions with complete trace bundles containing order setup, active world events, specialist handoffs, tool/function activity, review artifacts, and terminal state.

Run lower-level evals on each completed run

Run lower-level evals on each completed run using a tool like Promptfoo. Grade individual agents, handoffs, tools, and completed runs against rubrics that check: final decision quality, policy correctness, specialist routing, market drift awareness, and review appropriateness. Produce eval_finding labels for each trace.

Turn each trace into a compact document

Convert each trace into a compact document representation. Extract key information from the trace bundles including case_type (business situation), run_outcome (how the run ended), eval_finding (local symptoms), and other structured data to enable downstream analysis and clustering.

Discover recurring behavior patterns across the population

Analyze the population of compact trace documents to discover recurring behavior_pattern labels. Look for which kinds of problems repeat, where they concentrate, and which patterns emerge across many traces. Move from individual trace inspection to population-level pattern identification.

Drill into one high-impact pattern to find inspection points

Select one high-impact behavior pattern and drill into specific examples to understand where a human should inspect the system next. Use the pattern analysis to identify which part of the agent workflow needs attention and what concrete improvements could address the recurring issue.

Overview

Macro Evals for Agentic Systems

What it does

This prompt chain implements a macro-evaluation workflow for multi-agent systems. It separates lower-level evals (grading individual agents, handoffs, tools, and runs) from macro evals (discovering recurring patterns across many traces). The notebook walks through a synthetic EV order workflow with specialist agents handling pricing, compliance, supply, factory routing, scheduling, and release decisions, using precomputed traces and saved Promptfoo labels to demonstrate population-level pattern discovery.

How it connects

Use this when your agentic system fails in ways that span multiple agents or workflow steps-handoffs happening too late, specialist agents missing signals across many runs, or review processes triggering for the wrong cases. It's designed for AI engineering teams who need to move from thousands of agent events to a small number of actionable patterns that both technical and business stakeholders can understand and address.

Source README

Macro Evals for Agentic Systems

When an agentic system fails, the problem is often larger than a single bad response. A handoff may happen too late, a specialist agent may miss the same signal across many runs, or a review process may trigger for the wrong class of cases. To improve the system, teams need to see recurring behavior across the whole population of traces.

This cookbook walks through a macro-eval workflow for a multi-agent system. We use a synthetic EV order workflow where specialist agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions while market and operational conditions change.

The notebook uses precomputed synthetic traces and saved lower-level eval labels, so you can run the full workflow without an OpenAI API key.

You will learn how to:

Generate or collect many traced agent runs;
Run lower-level evals on each completed run;
Turn each trace into a compact document;
Discover recurring behavior patterns across the population; and
Drill into one high-impact pattern to find where a human should inspect the system next.

The goal is not to build a perfect taxonomy of every trace. The goal is to show how an AI engineering team can move from thousands of agent events to a small number of patterns that are understandable by both technical and business stakeholders.

End-to-End Agentic System Map

The key idea is that the notebook evaluates a saved agentic system, not a generic chat transcript. Scenario inputs drive an orchestrated specialist swarm, the runtime emits trace bundles, saved Promptfoo labels are joined to normalized traces, and the macro-eval layer turns that evidence into pattern and diagnosis views.

1. Why Macro Evals?

Evals are how AI teams measure whether a system is working. For a simple model call, an eval might compare one output against a rubric or reference answer. For an agentic system, we also need to evaluate whether the system used the right tools, delegated to the right specialist, paused for review when risk was high, and stayed grounded in the business context.

Multi-agent systems make this harder because a final answer is only the last event in a longer workflow. A release recommendation can look plausible while the trace reveals that the pricing agent ignored an incentive, the supply agent missed a stockout, or the orchestrator routed around a required review step.

This notebook separates the problem into two levels:

Lower-level evals grade individual agents, handoffs, tools, and completed runs. In this example, Promptfoo stands in for that agent-level eval layer by grading whether a run handled final decision quality, policy correctness, specialist routing, market drift, and review appropriateness.
Macro evals look across many lower-level findings. They ask: which kinds of problems repeat, where do they concentrate, and which part of the agent workflow should we inspect first?

We will use four reader-facing labels throughout the cookbook:

case_type: the generated business situation, such as a clean order, a validation block, a supplier substitution, or a pricing exception.
run_outcome: how the run ended, such as completed, awaiting review, blocked, or failed.
eval_finding: the lower-level signal that says what seemed wrong or risky.
behavior_pattern: the recurring pattern discovered across many traces.

A useful mental model is: case_type is the setup, run_outcome is the ending, eval_finding is the local symptom, and behavior_pattern is the population-level pattern.

Setup and Data Materials

Install the dependencies, then load the offline dataset bundled with this example. The saved Promptfoo labels are part of the local data folder, so this notebook does not require a separate Promptfoo config, Promptfoo run artifact, or OpenAI API key.

Expected files:

data/trace_results.jsonl
data/run_summary.json
data/trace_bundles.zip
data/eval_labels.jsonl

trace_bundles.zip is expanded automatically into a local cache the first time the notebook runs. A full SQLite trace snapshot can be placed at data/trace_snapshot.sqlite for optional enrichment, but it is not required for the end-to-end workflow.

If your data lives outside the example folder, set MACRO_EVALS_DATA_ROOT to that directory. If labels live separately, set MACRO_EVALS_LABELS_PATH.

2. The Simulation: Automotive Orders in a Changing World

The simulated business is an EV order and post-configuration workflow. A customer has chosen a vehicle configuration, and the company needs to decide whether the order can proceed as-is, needs adjustment, should be rerouted, requires substitution, or should pause for review.

The simulation includes the kinds of constraints that make real automotive fulfillment hard:

component availability and supplier substitution;
factory capacity and production scheduling;
pricing exceptions, promotions, and incentives;
tariffs and dated market signals;
regional compliance constraints;
customer clarification and escalation paths;
release review thresholds for risky or ambiguous cases.

The agent swarm is organized around those business responsibilities. An orchestrator receives the order and current environment, then delegates to specialists such as validation, supply risk, procurement planning, capacity balancing, factory routing, market intelligence, pricing, compliance, customer communications, and release review.

This maps naturally to the OpenAI Agents SDK. In the SDK, an agent is the core unit of a workflow: it packages a model, instructions, and optional runtime behavior such as tools, handoffs, guardrails, and structured outputs. The simulation follows that pattern:

specialized agents package the instructions and tools for one part of the decision;
handoffs let the orchestrator delegate to another specialist agent instead of stuffing every responsibility into one prompt;
function tools expose order data, environment signals, and approval markers through structured inputs and outputs;
guardrails and review thresholds represent validation, blocking, and human-review flows for risky or ambiguous cases;
structured outputs make downstream grading and aggregation possible;
traces preserve structured records of model calls, tool calls, handoffs, guardrails, and custom spans for debugging and macro-level analysis.

The low-level evals later in the notebook are grounded in this simulation story. If the case type says there is a supplier substitution under tariff pressure, the trace should show awareness of supply, policy, market, and review risk. If the case type is clean, unnecessary escalation is itself a finding.

What One Bundle Represents

In this notebook, a bundle is the evidence packet for one simulated customer-order interaction.

Imagine one customer has configured an EV and the business needs to decide what to do next. The swarm receives that order plus the current operating world: supply constraints, factory capacity, promotions, incentives, tariffs, competitor pressure, and review thresholds. The agents then route work through specialists and produce a final state. The bundle is everything we need to audit that interaction afterward.

A bundle matters because macro evals need the workflow evidence behind the final answer. They need to know which agents were consulted, which tools were called, which environment signals were active, whether review was required, and where the workflow changed direction. With that evidence, we can move from "what happened in this one run?" to "which workflow patterns repeat across many runs?"

How to Read the Dataset Profile

The dataset profile tells us the scale and texture of the simulated business process we are about to evaluate. Each analyzable row is one customer-order interaction with enough trace evidence to reconstruct what the agent swarm saw, which specialists it consulted, and how the workflow ended.

The generated batch asked the swarm to handle 1,000 synthetic order interactions. For 992 of them, we have a bundle: a complete evidence packet for grading the run, building a trace document, clustering it with similar runs, and inspecting the agent path afterward. That gives us a large enough population to look for repeated behavior while still retaining the trace detail needed to explain individual examples.

The typical bundle is a structured record of a simulated business process: the order setup, active world events, specialist handoffs, tool/function activity, review artifacts, and terminal state. That is why this dataset can support macro evals. We can evaluate individual decisions, and we can also ask whether repeated workflow patterns emerge across hundreds of rich interaction records.

What `case_type` Means

A case_type is a scenario label from the generator. It describes the kind of business situation the swarm was asked to handle before any eval or clustering has happened.

Examples from this dataset include:

clean_simple: a relatively straightforward order where the correct behavior is usually to complete without unnecessary review.
validation_block_simple: a configuration has a validation issue, so the swarm should avoid overconfident release.
supplier_substitution_compound: component availability creates a substitution decision, often with downstream routing and scheduling implications.
pricing_exception_compound: pricing, incentives, or margin policy need specialist review.
regional_compliance_compound: the order needs regional policy or compliance handling.

The bar chart above is a coverage view. It shows whether the simulation produced enough variety to evaluate the swarm under different business pressures. A strong macro-eval dataset needs both ordinary cases and pressure cases, because recurring patterns only become meaningful when we can compare behavior across different setups.

The table above turns the generator labels into business language. This is important because the same later pattern can mean different things depending on the setup. A fulfillment reroute in a supplier substitution case may be desirable. The same reroute in a clean case might be unnecessary complexity.

3. Lower-Level Agent Evals with Promptfoo

A mature multi-agent system should not rely on final-answer inspection alone. Each launched agent usually needs its own evals: did this specialist use the right evidence, call the right tools, respect policy, hand off at the right time, and produce an output that the rest of the system can trust?

Promptfoo plays that role in this notebook. It represents the lower-level eval layer that would normally live beside the agents in a production workflow. In a live system, some of these checks might run online, some might run asynchronously, and some might be sampled for human review. The implementation detail matters less than the contract: every run should carry eval signals that say what looked correct, risky, or wrong at the agent and workflow level.

In this dataset, Promptfoo grades completed traces with questions that mirror the kinds of agent-level evals teams build for real systems:

Did the final decision follow from the active issue?
Did the system respect pricing, tariff, incentive, regional, and policy constraints?
Did the orchestrator activate the specialists implied by the case?
Did the run respond to dated market signals rather than acting as if the world were static?
Was review or escalation proportionate to the risk?

These checks produce eval_finding. A failing lower-level eval is a local signal: one trace, one rubric, one symptom. The macro-eval sections later ask what those local signals become at population scale. Do they scatter randomly, or do they reveal repeated behavior patterns that point to a specific agent, handoff, tool, or business policy?

Interpreting the Promptfoo Outputs

The pie chart is the simplest lower-level scorecard: it separates traces that passed all rubric checks from traces with at least one failed check. In a live multi-agent system, this is the kind of layer that tells us which runs deserve attention before we do any macro analysis.

The failed-rubric bar chart answers a more useful question: which kinds of agent or workflow concerns appear most often? For this dataset, final decision quality is the dominant lower-level finding, while policy correctness, review appropriateness, and market-drift awareness also appear. That suggests the macro layer should focus less on isolated syntax errors and more on repeated decision-making patterns.

This is the bridge to macro evals. Promptfoo gives each trace local eval labels. The rest of the notebook asks how those labels organize across the whole population. In other words: agent-level evals create the raw signal, and macro evals turn many such signals into a map of recurring system behavior.

4. Build the Analysis Dataset

Now we normalize the run bundles into two analysis tables:

traces_df: one row per run, with metadata, outcome, findings, and document fields.
events_df: one row per normalized trace event, including handoffs, tool calls, status events, model responses, and review/finding markers.

We also build trace documents. The document is the modeling object that the BERTopic-style section will cluster. The notebook uses doc_structured_summary because it is compact but still preserves scenario, routing, state transitions, handoffs, findings, and terminal state.

The public analysis path is:

case_type -> run_outcome -> eval_finding -> behavior_pattern

The first three labels are known before clustering. The fourth appears after discovery.

Interpreting the Analysis Profile

The profile above confirms that the lower-level eval layer has joined onto the normalized trace population. The important numbers are:

normalized traces: the bundle-backed population we can inspect;
normalized events: the event-level evidence behind those traces;
case types: the scenario coverage produced by the generator; and
Promptfoo-failed or review/failure-bearing traces: the lower-level signal population most relevant for macro discovery.

The exact counts depend on whether you run the full notebook or set MACRO_EVALS_TRACE_LIMIT for a smoke test. The sample rows show how the notebook simplifies the raw data into readable labels. For example, a pricing_exception_compound case that ends in review with a final_decision_quality finding is now easy to follow through the rest of the notebook.

What the First Sankey Plot Teaches

The first Sankey plot is a pre-clustering view. It shows how generated case types flow into run outcomes and lower-level findings.

Read it from left to right:

wide bands from a case_type mean that scenario appears often;
splits into run_outcome show whether that scenario tends to complete, pause, block, or fail;
final bands into eval_finding show which lower-level rubric or runtime signal is attached.

This is already useful for a team. A business reader can ask whether the simulation produces the right kinds of pressure. An AI engineer can ask whether certain scenarios overproduce the same low-level finding. What it cannot yet answer is whether those findings represent the same underlying behavior pattern. That is why we cluster next.

Trace Documents: Turning Runs into Comparable Text

A raw agent trace is too detailed to cluster directly. It may contain hundreds of events, long model responses, tool payloads, and repeated status updates. The document construction step compresses each run into a comparable view while preserving the information that matters for macro evals.

A good trace document includes:

the business setup (case_type, selected route, active environment signals);
the run outcome and severity;
the important handoffs and specialist activations;
review/finding markers;
a short state-transition digest.

The document view defines what the clustering algorithm is allowed to notice. Including agent handoffs helps the macro eval discover routing patterns. Including environment signals helps it discover market-drift failures. The quality of the trace document is therefore part of the evaluation design, not a mechanical cleanup step.

Failure and Focus-Event Glossary

The raw traces contain many event-level labels. To keep the notebook readable, we do not ask readers to learn all of them. The AgentTrace-style section mainly cares about focus events: visible moments in the trace where the system appears to require attention.

In this simulation, common focus-event signals include:

review finding: a review or validation surface recorded an issue.
review required or awaiting_review: the run paused because the simulated business process required review.
failed or blocked: the run reached a degraded terminal state.
triage route or reroute signals: the workflow changed direction because another owner needed to act.
tool warnings or policy markers: a structured tool output indicated risk, ambiguity, or a policy constraint.

These are observability signals, not proof of root cause. They tell the diagnosis pass where to anchor its backward search.

The example document above is a single trace rendered as a compact narrative. It is intentionally denser than prose but easier to compare than a raw event log. When you adapt this workflow, spend real time on document construction. Better documents usually produce more useful behavior patterns than more complicated clustering settings.

5. BERTopic-Style Discovery

The discovery pass is inspired by the BERTopic family of methods. The high-level idea is modular:

Represent each trace document as a vector. If the document for trace $i$ is $d_i$, the embedding model produces a vector $e_i = f(d_i)$.
Reduce the vector geometry. A reducer such as UMAP maps $e_i$ to a lower-dimensional point $z_i$ that preserves useful local neighborhoods.
Cluster dense regions. A density clusterer such as HDBSCAN groups nearby points and can mark outliers as noise.
Represent each topic. For each cluster, compute terms that distinguish that cluster from the rest of the corpus.

This notebook uses the helper module to keep the implementation compact, but the major mathematical ideas are visible:

A trace belongs to a cluster $k$ when its document vector is near other trace vectors in the reduced space.
A term is useful for labeling cluster $k$ when it appears often inside $k$ and less often elsewhere.
A simple class-aware term score is:

$$
score(t, k) = tf(t, k) \times \log\left(\frac{1 + N}{1 + df(t)}\right)
$$

where $tf(t, k)$ is the term frequency for term $t$ inside cluster $k$, $df(t)$ is the number of clusters/documents where the term appears, and $N$ is the comparison population size. The exact implementation can vary, but the intuition is stable: labels should describe what makes a cluster distinctive.

Finally, we rank patterns by a triage metric:

$$
impact_score(k) = prevalence_share(k) \times severity_weighted_prevalence(k)
$$

This is not a universal risk formula. It is a practical prioritization score: a pattern matters more when it is both common and severe.

Interpreting the Discovery Output

The discovery summary tells us how many traces were clustered and how many non-noise behavior patterns were recovered. We run discovery on the traces that already have failure, review, runtime, or Promptfoo signals because this cookbook is focused on where the system needs attention.

The topic table should be read as a triage board:

trace_count and prevalence tell us how often the pattern appears.
severity_weighted_prevalence tells us how severe the traces in the pattern tend to be.
impact_score combines prevalence and severity into a ranking.
dominant_owner is a heuristic owner label, not an assignment.
keywords_text gives the terms that made the pattern distinctive.

A high-impact behavior pattern is not automatically a defect. It is where a reviewer should look first because the pattern is frequent, consequential, or both.

The table above makes the impact score concrete. A pattern can rank highly because it appears in many traces, because it concentrates higher-severity traces, or both. In the automotive configurator setting, that helps separate a rare edge case from a recurring operational behavior that may affect many orders.

Interpreting the Leaderboard and Trace Map

The leaderboard is the portfolio view: it ranks behavior patterns by weighted impact. Use it to decide which pattern deserves human attention first.

The trace map is a geometry view: each point is one trace document, placed near traces with similar text. Nearby points often share routing paths, findings, or environment signals. The colors show discovered behavior patterns. Treat the map as diagnostic, not exact geography. Its job is to reveal clusters and outliers that might be hard to see in tables.

In this dataset, patterns such as fulfillment reroutes, pricing drift, compliance gates, and wheel/trim mismatches correspond to recognizable business problems. This is the first moment where lower-level evals become a macro-level story: repeated agent behaviors are visible across many cases.

Interpreting the Case-Type Heatmap

The heatmap asks: which generated scenarios concentrate which behavior patterns?

Read each row as a behavior pattern and each column as a case type. Darker or larger values mean that a pattern is more common within that scenario slice. This helps distinguish expected behavior from surprising behavior. For example, a fulfillment reroute pattern may be expected in supplier substitution or capacity cases, but more suspicious in clean cases.

The table beneath the chart connects patterns back to lower-level findings. If one behavior pattern repeatedly carries final_decision_quality findings, an AI engineer may inspect prompts, tool schemas, or handoff policies. If the pattern maps to a business-specific case type, a product or operations stakeholder can ask whether the simulated policy itself is realistic.

Comparing Patterns Across Slices

This step appears here because BERTopic-style discovery has just given every risky trace a behavior_pattern. Before clustering, we could compare generated cases, outcomes, and lower-level eval findings. After clustering, we can ask a more useful macro-eval question: where does each discovered behavior pattern concentrate?

This comparison is not a core equation from the BERTopic paper. It is a simple cohort-analysis layer we apply after topic assignment. The idea is to compare two shares:

overall pattern share: among all clustered traces, what share belongs to this behavior pattern?
slice pattern share: within one slice, such as case_type = supplier_substitution_compound, what share belongs to this behavior pattern?

Then we compute:

$$
lift = \frac{slice\ pattern\ share}{overall\ pattern\ share}
$$

A lift of 1.0 means the pattern appears in that slice about as often as it appears overall. A lift above 1.0 means the pattern is concentrated in that slice. A lift below 1.0 means it is less common there.

In macro evals, this is the bridge from discovery to action. A behavior pattern is easier to investigate when we can say where it shows up: a generated scenario, an agent version, an orchestration mode, a market regime, or a review state.

The table above should be read as an investigation queue. It highlights behavior patterns that are unusually concentrated in a given case_type, while requiring at least a small number of supporting traces. For example, if a routing pattern is much more common inside supplier-substitution cases than it is overall, that suggests the team should inspect supplier tools, procurement handoffs, and fulfillment policy before treating the pattern as a generic system issue.

What the Second Sankey Plot Adds

The second Sankey plot adds the discovered behavior_pattern as the final step:

case_type -> run_outcome -> eval_finding -> behavior_pattern

This is the key macro-eval move. The first three labels describe the generated setup, the ending, and the local symptom. The final label shows whether those local symptoms collapse into a smaller number of repeated operating patterns.

A business stakeholder can use this to ask, "Which order scenarios are creating the most repeated operational issues?" An AI engineer can use it to ask, "Which lower-level findings are actually the same routing or decision pattern?" Both views are useful, and the Sankey gives them a shared map.

6. AgentTrace-Style Diagnosis

Discovery tells us what repeats. Diagnosis asks where to inspect first.

For a selected behavior pattern, we reconstruct a lightweight execution graph:

$$
G = (V, E)
$$

where each node $v \in V$ is a normalized trace event and each edge $e \in E$ links events through temporal order, handoffs, tool calls, and nearby execution context. We then choose a focus event, also called an anchor. In this simulation, a focus event is usually a review/finding marker, failure-related status, or late-stage decision event.

From that anchor, the diagnosis pass walks backward through the graph and scores upstream suspects. The score is intentionally explainable:

$$
suspect_score =
0.4 \cdot proximity +
0.3 \cdot frequency +
0.2 \cdot bridge +
0.1 \cdot role
$$

Proximity rewards events close to the focus event.
Frequency rewards events that recur across sampled traces in the same behavior pattern.
Bridge rewards events that connect parts of the execution graph.
Role rewards events whose agent/tool role is plausibly related to the finding.

This is not proof of causality. It is a way to turn "this pattern is important" into "inspect these agents, tools, handoffs, or review policies first."

Interpreting the Suspect Leaderboard

The focus behavior pattern is selected by impact score. Depending on whether you run the full dataset or a smaller smoke-test sample, the selected pattern may differ, but the reading process is the same: start from the highest-impact pattern, then inspect which review signals, handoffs, tools, or specialist responses repeatedly appear near the focus event.

A row such as eval/review signal: review finding is not meant to be mysterious. In the simulation, a review finding is a structured marker produced when a specialist or review surface observes an issue that should affect the order decision. It is the endpoint we trace backward from: the moment when the workflow has accumulated enough evidence to say, "this order needs attention."

The more actionable rows are the operational events around that marker: handoffs involving the orchestrator, tool/function calls by monitor or orchestration agents, procurement-planning handoffs, and related specialist responses. Those are the places a human should inspect after the macro eval points to this pattern.

From a technical perspective, this output tells an AI engineer where to inspect:

agent instructions and tool contracts for the named agents;
handoff rules around the repeated transition;
whether the system is recording review markers too early or too late;
whether a tool output is being ignored or over-weighted.

From a business perspective, the same output tells an operations or product stakeholder which business function appears to own the pattern. A fulfillment, pricing, compliance, or clarification pattern should bring the corresponding business owners into the next review, not only the prompt engineer.

Interpreting the Story Strip and Swimlane

The story strip is a path into the focus event. In this run, the focus event is a review/eval checkpoint inside the selected behavior pattern. It is the simulated business process saying that this order has an issue worth reviewing.

The swimlane view keeps more temporal structure. It shows the surrounding window of events by lane or agent, with the focus event highlighted. Read it left to right as the order moves through the swarm:

Which specialist handled the order before the review finding?
Did the orchestrator route through the right business owners at the right time?
Did a tool/function call surface information that should have changed the order decision?
Did review happen before the workflow committed to a release, reroute, pricing, compliance, or customer-communication recommendation?

For a business reader, the diagram turns an abstract pattern into an operational story: this set of orders repeatedly reaches a similar review point. For an AI engineer, it narrows the next debugging step: inspect the orchestration and handoff path around the review marker, especially the first non-review suspect highlighted in the diagnosis summary.

7. What We Learned and What to Do Next

The cookbook has moved through four levels of evidence:

Simulation setup: the business generated EV order cases under changing supply, pricing, capacity, compliance, and market conditions.
Lower-level evals: Promptfoo supplied the agent/workflow-level eval signals: decision quality, policy correctness, routing, market awareness, and review appropriateness.
Macro discovery: BERTopic-style clustering grouped lower-level findings into recurring behavior patterns and ranked them by impact.
Trace diagnosis: AgentTrace-style graph analysis inspected one high-impact pattern and identified repeated upstream suspects.

This approach scales by directing human attention toward the patterns that are both frequent and consequential. Instead of reading hundreds of traces from top to bottom, a reviewer can start from a behavior pattern, inspect representative examples, and decide which agent, tool, handoff, or business rule deserves follow-up.

Practical next steps for an AI engineering team:

promote the clearest lower-level eval failures into a regression suite;
review a small sample of automated grades to calibrate rubric strictness;
track behavior patterns by model version, prompt version, and orchestration mode;
assign business owners to the highest-impact patterns;
inspect the top suspect agents, tools, and handoffs before changing the system.

Practical next steps for a business stakeholder:

decide whether the generated case types match the real operating risks;
check whether high-impact patterns correspond to important customer or operational outcomes;
validate whether review thresholds are producing the intended business behavior;
use the Sankey and heatmap views to prioritize which scenarios need better policy or process design.

The core lesson is simple: agent-level evals tell us which local behaviors look risky, while macro evals tell us what those risks become at system scale.

Further reading

OpenAI Agents SDK documentation: Agents SDK use cases
BERTopic documentation: overview and algorithm walkthrough
Promptfoo documentation: OpenAI Agents provider and chat conversations
AgentTrace paper: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Step 1: Generate or collect many traced agent runs

Generate or collect many traced agent runs from your agentic system. In this example, a synthetic EV order workflow produces 1,000 simulated customer-order interactions with complete trace bundles containing order setup, active world events, specialist handoffs, tool/function activity, review artifacts, and terminal state.

Step 2: Run lower-level evals on each completed run

Run lower-level evals on each completed run using a tool like Promptfoo. Grade individual agents, handoffs, tools, and completed runs against rubrics that check: final decision quality, policy correctness, specialist routing, market drift awareness, and review appropriateness. Produce eval_finding labels for each trace.

Step 3: Turn each trace into a compact document

Convert each trace into a compact document representation. Extract key information from the trace bundles including case_type (business situation), run_outcome (how the run ended), eval_finding (local symptoms), and other structured data to enable downstream analysis and clustering.

Step 4: Discover recurring behavior patterns across the population

Analyze the population of compact trace documents to discover recurring behavior_pattern labels. Look for which kinds of problems repeat, where they concentrate, and which patterns emerge across many traces. Move from individual trace inspection to population-level pattern identification.

Step 5: Drill into one high-impact pattern to find inspection points

Select one high-impact behavior pattern and drill into specific examples to understand where a human should inspect the system next. Use the pattern analysis to identify which part of the agent workflow needs attention and what concrete improvements could address the recurring issue.

Discussion

Analyze Agentic System Performance Across Many Runs

What it gets done

Add it to your toolbox

Steps in the chain

Macro Evals for Agentic Systems

What it does

How it connects

Macro Evals for Agentic Systems

End-to-End Agentic System Map

1. Why Macro Evals?

Setup and Data Materials

2. The Simulation: Automotive Orders in a Changing World

What One Bundle Represents

How to Read the Dataset Profile

What case_type Means

3. Lower-Level Agent Evals with Promptfoo

Interpreting the Promptfoo Outputs

4. Build the Analysis Dataset

Interpreting the Analysis Profile

What the First Sankey Plot Teaches

Trace Documents: Turning Runs into Comparable Text

Failure and Focus-Event Glossary

5. BERTopic-Style Discovery

Interpreting the Discovery Output

Interpreting the Leaderboard and Trace Map

Interpreting the Case-Type Heatmap

Comparing Patterns Across Slices

What the Second Sankey Plot Adds

6. AgentTrace-Style Diagnosis

Interpreting the Suspect Leaderboard

Interpreting the Story Strip and Swimlane

7. What We Learned and What to Do Next

Step 1: Generate or collect many traced agent runs

Step 2: Run lower-level evals on each completed run

Step 3: Turn each trace into a compact document

Step 4: Discover recurring behavior patterns across the population

Step 5: Drill into one high-impact pattern to find inspection points

Questions & comments · 0

What `case_type` Means