Build an Agent Improvement Loop with Traces, Evals, and Codex

This notebook builds an improvement flywheel for an agent. We start with real traces, add human and model feedback, turn that feedback into evals, and use the resulting evidence to propose the next harness changes for Codex to implement.

You will:

Create an OpenAI Agents SDK-backed financial analyst
Run it on synthetic company data and capture traces
Add example human feedback and LLM-generated feedback from those runs
Turn that feedback into Promptfoo evals that can be rerun later
Use HALO to rank the next harness changes and write a Codex-ready handoff

In this notebook, the harness is the full contract around the model, including instructions, tools, routing, output requirements, and validation checks.

The flywheel preserves what you learn from each run. Traces show what happened, feedback explains what mattered, evals make those expectations reusable, and Codex can act on the resulting change set.

What you will build

Agent improvement loop flywheel

By the end, you will have:

An OpenAI Agents SDK-backed financial analyst that reviews a fictional company's diligence materials across five traced runs
Human and LLM-generated feedback over those same traces
An automatically generated Promptfoo eval suite
A Promptfoo validation gate over the current agent behavior
A HALO optimization pass over the traces, feedback, and eval results
A developer-facing handoff to Codex so it can implement the recommended harness changes

The agent supports acquisition diligence for a fictional company. It reviews financial exports, customer data, contracts, security notes, board materials, and management narratives, then answers diligence questions with citations and reviewable artifacts.

The loop writes one file that carries the work forward: the generated codex_handoff.md file under ARTIFACT_DIR. It contains the full HALO diagnosis, the ranked recommendations, the evidence behind them, and the implementation guidance Codex needs for the next harness update.

The degree of automation is up to the developer. You can use the loop to propose a reviewed change set, or connect it to a workflow that opens, merges, and deploys pull requests automatically. A common starting point is a reviewed loop, where the system proposes the change set and a developer approves the diff before merge. As the eval gate becomes more trusted, the same handoff can support deeper automation. The core workflow is the same in either case: traces plus human and model feedback become concrete harness changes instead of remaining disconnected comments.

Compared with examples that stop at traces or evals, this notebook keeps traces, reviewer judgment, generated evals, optimization, and implementation handoff inside one runnable improvement loop.

Prerequisites

Run this notebook from the repository root after installing the Python dependencies used by the example:

python -m venv .venv
source .venv/bin/activate
pip install openai openai-agents halo-engine

Promptfoo runs through npx, so you also need Node.js with npx available on your path.

Set an API key before running the notebook:

export OPENAI_API_KEY=...

The example is intentionally live-only. The trace generation, model critique, eval generation, validation, and optimization steps all use fresh model outputs so the notebook demonstrates the actual loop rather than a scripted preview. The next cell exposes the model choices in one place so you can trade quality for cost by substituting cheaper models if desired.

With the default five traces, budget about 20 minutes for a full run, though model latency and network conditions will move that up or down. The longest sections are usually Step 3, which runs the traced agent calls, and Step 7, where HALO analyzes the full loop. The feedback, eval-generation, and Promptfoo cells also make live calls, but are typically shorter. Long-running cells print progress or elapsed time as they work.

Step 1. Create synthetic company data

The notebook creates fictional diligence materials for a company that might be reviewed during an acquisition. The data mixes structured exports with narrative markdown documents so the agent has to decide which sources deserve more weight.

Narrative markdown files in the synthetic data

File	Why it is included
`overview.md`	Management's top-level company summary
`product_strategy.md`	Roadmap context plus an unvalidated NRR estimate
`go_to_market.md`	Sales-motion context that should be checked against pipeline data
`board_deck.md`	A polished management narrative that can conflict with structured exports
`financials/revenue_recognition_notes.md`	Accounting context for launch-stage ARR treatment
`legal/contracts_summary.md`	Contract-level risk context
`legal/open_issues.md`	Open legal matters that should remain visible
`security/security_overview.md`	Security posture and certification wording
`sales/security_faq.md`	Sales-facing security language that may overstate the evidence
`hr/org_chart.md`	Operating context for leadership and staffing
`sales/pipeline_notes.md`	Qualitative pipeline commentary
`notes/qa_log.md`	Diligence questions and unresolved follow-ups

The example generates the synthetic company data at runtime so it stays self-contained while still giving the agent a realistic mix of structured exports and narrative documents to analyze.

Define the synthetic source files

The next collapsed cell contains the source documents used to build the fictional company data.

Materialize the synthetic data

Write the source files to disk, add a manifest, and inspect the generated dataset.

Step 2. Define the Agents SDK-backed analyst

The example agent performs acquisition diligence on a fictional SaaS company being reviewed as a possible acquisition target. The case materials contain both structured exports and management narratives. Some sources agree, some conflict, and some important claims are only partially supported. That gives us a realistic reason to improve the harness over time.

The agent answers questions for an investment team using only the supplied company data. It should prefer structured financial evidence over narrative summaries when they disagree, preserve uncertainty when evidence is missing, and leave behind artifacts that another reviewer can inspect.

The OpenAI Agents SDK provides the managed runner, sandbox execution, model settings, and tracing hooks this workflow needs. Together, the prompt, tools, routing rules, output requirements, and validation checks form the current agent harness.

Artifacts generated by the agent

Artifact	Why the agent writes it
`summary_answer.md`	The concise answer returned to the user
`investment_memo.md`	A fuller review artifact for diligence readers
`risk_register.json`	Structured risks with evidence that downstream systems can inspect
`open_questions.md`	Missing evidence or unresolved questions that should stay visible
`citations.json`	A machine-readable link from claims to source files
`evidence_table.csv`	A tabular audit trail of claims and supporting sources

These artifacts keep the work reviewable by preserving supporting evidence, unresolved questions, and required files alongside the final answer.

Failure modes to watch for

This notebook is designed to surface failures such as:

Treating management narrative as an official metric when the structured exports disagree
Reporting an unsupported NRR estimate as if finance had validated it
Collapsing parent-account concentration into a weaker legal-entity view
Saying “SOC 2 complete” when the evidence only supports Type I
Producing a polished answer while leaving citations, risk files, or evidence artifacts incomplete

Define the harness schema

Start with small data structures for the model settings and promoted agent configuration. These make the harness explicit so later optimization can target more than prompt wording.

Configure instructions and policies

The system prompt states the evidence rules, the tool policy defines what the agent may read and write, and the eval metadata records which version of the harness is currently promoted.

Inspect the agent config

This compact view shows the promoted config version, the selected models, the required artifacts, and the runtime tools the agent can use.

Add validation tools

The next helpers create two local tools inside the workspace: one checks whether drafted claims cite real dataroom files, and the other verifies that the required output artifacts exist and have the expected shape. The code is hidden by default to save space, but you can expand it if you want to inspect the implementation.

Build each user turn

The prompt builder adds task-specific guidance only when it is needed, such as memo formatting, separate risk categories, or strict handling for unsupported NRR claims.

Export traces for later optimization

The local exporter converts Agents SDK events into the OpenTelemetry-style JSONL that HALO can read later. It is implementation-heavy, so the code stays collapsed by default.

Configure the trace exporter

Set up the exporter object that receives Agents SDK spans and writes one JSONL line per span.

Map SDK spans into HALO-readable fields

These helpers translate each SDK span type into the attributes HALO will inspect later.

Normalize helper values

The final helpers keep IDs, timestamps, and serialized values consistent across exported spans.

Run the SDK agent

run_sdk_agent() calls the Agents SDK runner directly while handling the repeated setup around each traced run: mounting the data, attaching tracing, executing the agent, and collecting the output artifacts.

Step 3. Generate traced runs

The questions are intentionally varied so the eval suite covers several ways the agent can go wrong. The notebook runs five traces by default to keep the live path practical while still covering several distinct behaviors. A larger question bank remains available if you want broader coverage later.

Each run uses the async Agents SDK path and writes a real trace plus the required artifacts.

Inspect the agent artifacts

Each traced run writes the full artifact set required by the harness. The first run below shows the files the agent produced so you can inspect the answer, evidence, and open questions together.

Step 4. Generate example human feedback and model insights

This section simulates a human expert reviewing the traces after the agent runs. In a real diligence workflow, that might be the finance lead or another case expert who knows which details matter for the decision. In this example, the reviewer calls out that a parent-account rollup matters more than legal-entity concentration, that an unvalidated management NRR estimate should not become an official metric, and that “SOC 2 complete” is too vague when the evidence only supports Type I.

The model-generated insights stay separate. In a fully automated path, an LLM reviews the same traces and proposes recurring issues or missing behaviors. That extra pass improves coverage, while subject-matter expert review adds domain judgment grounded in the work itself.

Step 5. Generate Promptfoo evals from traces and feedback

The eval suite is generated dynamically by an LLM from the evidence collected so far: traced behavior, human feedback, and model-generated observations. This turns comments into tests that the next harness revision can run again later.

Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. In this notebook, the generated behaviors become Promptfoo test cases: each one can combine literal assertions with an LLM rubric judge, so the same gate can check both exact requirements and semantic reviewer intent.

Evals are a good place to invest manual effort from subject-matter experts and developers. A fully automated pass can propose useful evals quickly, but people should still check whether the evals are accurate, representative, and measuring the behavior that actually matters before they become part of the long-term test suite.

Step 6. Validate the current harness with Promptfoo

Promptfoo runs the generated tests against the current trace outputs. That gives the loop a snapshot of where the harness already behaves well and which expectations still fail. Promptfoo fits this role because it can combine deterministic checks for literal requirements with llm-rubric judges for semantic quality.

In this notebook, the Promptfoo gate scores existing trace outputs. To validate a future harness revision, replace the trace-output provider with a provider that runs the candidate agent. Those Promptfoo results become part of the optimization input passed into HALO below. Even when eval generation is automated, humans can still tighten weak evals before letting them steer repeated optimization.

Build the Promptfoo test harness

The provider serves existing trace outputs back to Promptfoo, and the test builder turns generated eval definitions into runnable Promptfoo cases.

Run the Promptfoo gate

Execute the generated suite and summarize the current harness result.

Step 7. Run HALO and write the handoff

HALO, short for Hierarchical Agent Loop Optimization, is a methodology and Python package for improving agent harnesses from execution traces. The HALO repository describes a loop that collects traces, analyzes recurring harness-level failures, hands the resulting report to a coding agent, and repeats after the harness changes.

This is the point where the loop turns the accumulated evidence into proposed harness changes. HALO reviews the current harness together with the agent traces, human feedback, model feedback, generated evals, and Promptfoo results. It then produces a ranked set of changes for the next implementation pass.

The value of HALO here is that it reasons over the whole loop at once. It can use human judgment alongside runtime behavior and eval outcomes, then package the result as a handoff Codex can use to implement the code changes that improve the harness.

Collect the HALO inputs

Build one context object that keeps the current harness, traces, feedback, evals, and gate results together.

Attach feedback, generated evals, and eval results to the traces

Write the combined trace file that HALO will inspect. Human feedback, LLM feedback, generated eval definitions, and row-level Promptfoo results are attached to the matching runtime trace. The overall gate summary stays global because it describes the suite as a whole.

Define the HALO output prompt

This prompt tells HALO what kind of report to produce, including the sections Codex should receive in the final handoff file. You can customize it to match your company's workflow, review process, or use case.

Run HALO and format the report

HALO receives the five SDK execution traces plus two synthetic global traces: one records the current harness config, and one records the Promptfoo gate summary. That is why its trace count is higher than the five agent runs created earlier.

Generate the full optimization report, save the handoff artifact, and display the highest-priority recommendations in the notebook.

Step 8. Hand the full report to Codex

HALO diagnoses and prioritizes. A coding agent or human still changes the harness.

Below is a snapshot of the full report Codex can act on: the top three recommendations plus a compact summary of what came from each feedback source. The complete codex_handoff.md file also includes the ranked changes, supporting evidence, and validation guidance for implementation.

Step 9. Close the loop

Now that the full workflow is in place, we can revisit the optimization flywheel from the top of the notebook. The same architecture supports two operating modes.

Agent improvement loop flywheel

Human review gates in the loop

It can run as a closed loop, where new traces, human and model feedback, generated Promptfoo evals, HALO diagnosis, Codex implementation, validation, and deployment all feed the next cycle. In that mode, the handoff artifact can be written to shared storage, and a Codex automation with a heartbeat can keep checking for new handoffs, wake up when one appears, and trigger the next implementation pass automatically.

The developer can also add human gates wherever they want them, including trace review, eval refinement, pull request approval, merge, and deployment.

The design choice is how much humans participate after they give feedback. Human judgment can steer a loop where agents do the execution, or humans can remain approval gates throughout the process. In both versions, human feedback stays central because it shapes what the system learns and what it changes next.

Conclusion

An agent improvement loop offers a path toward continual improvement without reducing the problem to prompt tuning alone. The full loop matters: traces capture behavior, human feedback adds judgment, evals preserve what the system should do, HALO turns the evidence into ranked harness changes, and Codex can implement the next pass.

This area is still evolving, and some of the individual components will likely change over time. The larger idea of loop engineering is the durable part: agents can improve from real behavior when feedback, testing, and implementation are connected in one loop.

Next steps

Choose the model for each stage of the loop by editing AGENT_MODEL, ANALYSIS_MODEL, EVAL_GENERATION_MODEL, JUDGE_MODEL, and HALO_MODEL near the top of the notebook.
Create your own traces to test the agent.
Decide how much of the final path should remain reviewed versus automated: you can stop at a developer-reviewed PR, or wire the handoff into a system that opens, merges, and deploys changes automatically.
Pass the generated codex_handoff.md file under ARTIFACT_DIR to Codex, inspect the harness changes it proposes, and rerun the same eval suite against the updated harness.

Build an Agent Improvement Loop

What you get

Use this prompt chain