Build iterative repair loops with Codex

This cookbook is about closed-loop agent workflows: agents that produce an output, validate it, and use the feedback to improve the next pass.

We'll explore a documentation reliability workflow that detects, repairs, and validates stale or broken API and SDK examples. The worked example uses intentionally stale notebooks adapted from this Cookbook repository.

We'll build this agent loop with Codex. Codex reviews the current state, applies focused changes, runs validation, and repeats when the feedback shows remaining issues.

The notebook task is only the example. The pattern applies wherever agent output can be measured with trustworthy feedback.

The workflow has three phases:

Review: inspect the current artifact and return structured findings without editing files.
Repair: apply focused edits to a copied artifact using the findings and the latest validation feedback.
Validate: run the relevant checks and report what still needs work.

Validation closes the loop. The repaired notebook has to satisfy the checks that matter, and any remaining issues become the next repair input.

Codex iterative repair loop for technical documentation

Setup

This notebook uses Codex CLI in headless mode, so the repair steps can run from Python cells instead of a chat UI. The first code cell installs the CLI; if you already have it, you can skip that cell.

Before you run the live repair loop, set OPENAI_API_KEY in your environment.

The notebook defaults to a fast repair model so the full example can finish in a reasonable amount of time. To experiment with a different model, set REPAIR_MODEL before you start. The install cell pins a known Codex CLI version for reproducibility; update that version intentionally when you want newer CLI behavior.

Load the sample artifacts

The cells below load the three companion notebooks and summarize the metadata that drives the repair loop.

The samples are small on purpose. They run quickly, but they still exercise the architecture: review finds substantive issues, repair makes focused edits, and validation produces feedback for the next pass.

If you download this notebook by itself, also download the companion data/docs/ folder and place it next to the notebook before running the cells below. The code expects those sample notebooks to be available locally.

In this example, validation executes each repaired notebook end to end. In another domain, validation might be a unit test, policy check, schema validator, simulation, or human approval step. The important part is that failures become structured feedback instead of a dead end.

Define business rules and issue taxonomy

Before asking Codex to review or repair an artifact, give it a small shared contract. That keeps the loop focused on the issues that matter, instead of asking the model to infer every product and style rule from scratch.

The rules below define what "good" means for these example notebooks: current API patterns, clear setup, runnable local samples, and preservation of the original teaching goal. In another workflow, this contract would describe that domain's source of truth.

Define structured outputs

Each phase returns structured data so the next phase has something concrete to use.

Review returns findings. Repair returns a change summary and the path to the updated artifact. Validation returns the remaining delta for the next pass. With structured handoffs, the loop is easier to debug, rerun, and adapt to other artifact types.

Review phase

The review phase reads the artifact and returns structured findings. It does not run validation and it does not edit files. That separation keeps the first step focused: identify likely problems before changing anything.

We send the review prompt to codex exec with a JSON schema. The schema keeps the result machine-readable, so later cells can pass findings directly into the repair prompt instead of scraping prose from a previous answer.

Repair phase

The repair phase gets the current artifact, review findings, business rules, and any validation feedback from the previous pass. The prompt gets more specific as the loop learns.

Codex edits a copy inside the iteration directory and returns a short summary of what changed. The loop does not assume the edit worked; validation decides that in the next step.

Validation phase

Validation works like a small eval. We define the behavior we want, run the relevant check, and ask a judge to score the result against that rubric.

For the documentation example, execution comes first. Many notebook problems only appear at runtime: a missing import, a stale file path, a cell that depends on an old API response, or setup guidance that was clear to the author but not to a fresh reader.

If validation fails, the failure becomes evidence for the next repair pass. This keeps the next repair grounded in observed behavior, not just what looked right in the diff.

Save per-iteration outputs

Each iteration writes a record.json file and, for this example, a repaired notebook under CODEX_REPAIR_RUNS_DIR/iteration_N/<sample_name>/. If you do not set CODEX_REPAIR_RUNS_DIR, the notebook writes to your system temp directory so a normal repo checkout stays clean.

Those files are the audit trail. You can see what the review found, what Codex changed, whether execution passed, and what feedback carried into the next iteration.

A record.json file is the receipt for one loop attempt. It keeps the handoff between phases in one place:

{
  "review": [{"issue_type": "deprecated_api", "severity": "high"}],
  "repair": {
    "changes_made": ["Updated the notebook to use the current API pattern."],
    "updated_artifact_path": "/tmp/codex_iterative_repair_loop_outputs/iteration_1/sample/updated.ipynb"
  },
  "validation": {
    "passed": false,
    "remaining_delta": ["One setup instruction is still unclear."]
  }
}

That compact record is what lets a maintainer review the loop without reconstructing the whole run from notebook diffs and terminal logs.

Run iteration 1

Each notebook case is independent, so we process the cases concurrently. This keeps the demo fast while preserving the same review, repair, and validation flow for every sample.

Iteration 1 reuses the initial review findings from the earlier review cell. After this pass, inspect the returned booleans: passing cases can stop, and failing cases carry their validation feedback into the next pass.

Run iteration 2

Iteration 2 is where the loop starts to pay off. Codex is no longer working only from the original review; it also sees what happened during validation.

That changes the task. Instead of asking for a broad rewrite, we ask for the next useful repair based on evidence from the last run: what executed, what passed, and what still needs attention.

For the included staged fixtures, this pass is designed to clear the medium-depth Evals case while the deeper Knowledge Retrieval case continues with a smaller, more specific delta.

Run iteration 3

Iteration 3 focuses on the deepest documentation case.

The Knowledge Retrieval fixture has to modernize the API shape, stay runnable with local data, and preserve the retrieval teaching flow. Those requirements can pull against each other: a repair that makes the notebook modern might accidentally make it less runnable, while a repair that keeps it local might remove too much of the original lesson.

The third pass gives Codex the latest notebook plus the final validation delta. This is the part of the demo that shows why iteration matters: the agent responds to the specific issue that remained, rather than trying to anticipate everything up front.

Summarize improvement

Now we can look at the whole run instead of opening every intermediate artifact by hand. The summary below shows the signal that matters most: which artifacts passed, how many validation findings remained, and whether any delta carried forward.

For the included fixtures, the intended shape is simple: one notebook clears in iteration 1, another clears in iteration 2, and the deepest one clears in iteration 3. In a real maintenance workflow, this table tells you whether the loop is converging or needs a clearer constraint or human review.

This summary is also useful for human review. A maintainer can start with the pass/fail pattern, open records for anything that still has a delta, and inspect only the repaired artifacts that are ready for review.

What the summary tells us

The important signal is not that Codex made edits. The important signal is that the remaining validation delta gets smaller as the loop runs.

Pass	Signal to look for	Why it matters
Iteration 1	The simplest fixture passes; deeper fixtures keep a small delta.	The loop can make an initial repair while carrying forward the cases that still need evidence.
Iteration 2	The medium-depth fixture clears after seeing validation feedback.	Runtime and judge feedback become useful repair instructions.
Iteration 3	The deepest fixture clears or leaves a focused final delta.	The loop converges, or it produces a clear handoff for a human reviewer.

The record.json files are where this becomes auditable. A useful record answers four questions: what did the review find, what did Codex change, did the notebook execute, and what remains? That is the difference between an impressive-looking edit and a repair workflow a maintainer can trust.

Generalize to a continuous loop

The fixed three-pass run above is useful for teaching the pattern. A production loop should decide when to stop on its own.

A good loop usually stops for one of four reasons: validation passes, the loop reaches a maximum number of attempts, the remaining delta stops changing, or the next decision needs human review. Those stop conditions are just as important as the repair prompt.

The other production detail is the audit trail. Keep the review findings, repaired artifact, validation result, validation judgment, and remaining delta for every pass. That record lets a maintainer understand why the loop continued, why it stopped, and which artifact is ready for review.

Where else this applies

The notebook walkthrough is just one way to teach the architecture. The same pattern helps whenever an agent changes a file or process that needs more than subjective review before it is accepted.

A few high-value examples:

Protocol optimization: Draft an update for expert review, then validate it against dosing rules, timing constraints, or required safety checks.
Regulatory remediation: Draft updates to regulated content, then check that required language, citations, approvals, and jurisdiction-specific terms remain intact.
Support knowledge refresh: Update an article, test it against current product behavior or known resolutions, and carry mismatches into the next pass.
Code modernization: Replace deprecated APIs, run tests or static checks, and use remaining failures to guide the next repair.

The common thread is that the change matters, and each pass needs evidence. Whether the target is a notebook, a policy, a protocol, a support article, a pipeline, or a codebase, the loop gives the agent a way to improve it with evidence a maintainer can review.

Conclusion

Iterative repair loops make agentic maintenance easier to review and operate because they separate judgment from proof.

Review finds candidate issues. Repair makes focused edits. Validation executes the artifact and produces the next delta. When those phases exchange structured outputs, the workflow becomes easier to inspect, repeat, and adapt.

The main idea is simple: instead of relying on a single pass, give the workflow a way to learn from the artifact, make a bounded repair, and react to real validation feedback. That small change makes agentic maintenance much more practical.

Iteratively Repair Code with AI Feedback Loops

What you get

Use this prompt chain