What are the three main types of prompt defects this tool detects and fixes?

The tool detects and repairs contradictions inside the developer prompt's own instructions, missing or unclear structured-output format specifications, and inconsistencies between the prompt's rules and its few-shot example conversations.

How many specialized agents does this workflow use, and what do they do?

The workflow uses five specialized agents: Dev-Contradiction-Checker finds self-contradictions, Format-Checker validates output format specifications, Few-Shot-Consistency-Checker ensures examples comply with the prompt's rules, Dev-Rewriter fixes contradictions and adds format sections, and Few-Shot-Rewriter regenerates non-compliant assistant examples.

What model and SDK does this tool use?

The tool is built on the OpenAI Python package and the openai-agents SDK, using gpt-4.1 as the model for each checker and rewriter agent, authenticated via an OPENAI_API_KEY environment variable.

What does this tool NOT optimize for?

This tool does not optimize for creative quality, tone, or persuasiveness—it explicitly treats subjective qualities as out of scope unless the developer prompt states them as hard requirements.

Prompt Chain

Refine Prompts for Reliable AI Outputs

Name: Optimize Prompts
Availability: OnlineOnly
Author: OpenAI Cookbook

Automatically detect and fix prompt contradictions, format gaps, and few-shot mismatches with a multi-agent system.

Copy chain

Works with openai

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 14 days ago

Version 1.0.0

Models

gpt 4o

Add to Favorites

Why it matters

Enhance your AI prompts by automatically identifying and resolving contradictions, unclear formatting, and inconsistencies with few-shot examples. Ensure your prompts are precise and effective for better AI performance.

Outcomes

What it gets done

Detect and correct logical contradictions within prompt instructions.

Clarify and enforce specific output format requirements.

Ensure consistency between prompt rules and provided examples.

Rewrite prompts to maintain original intent while improving clarity.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-optimizeprompts | bash

Steps

Steps in the chain

Step 1: System Overview

Step 2: Data Models

Step 3: Defining the Agents

Step 4: Using Evaluations to Arrive at These Agents

Step 5: Run Optimization Workflow

Step 6: Examples

Overview

Optimize Prompts

A multi-agent prompt optimization system built with the OpenAI Agents SDK and Evals, detecting and fixing contradictions, format gaps, and few-shot inconsistencies in a developer prompt. Use it on a developer prompt that may have subtle contradictions, missing format specs, or examples that drift from stated rules. Also available directly in OpenAI's playground.

What it does

This cookbook builds an early version of OpenAI's prompt optimization system using the Agents SDK and Evals: a multi-agent pipeline that detects and fixes contradictions, unclear format specifications, and inconsistencies between a prompt and its few-shot examples. Five specialized agents split the work - Dev-Contradiction-Checker finds logical contradictions (like requiring both positive-only and negative examples), Format-Checker flags structured-output expectations that lack explicit format rules, Few-Shot-Consistency-Checker verifies example conversations actually follow the prompt's stated rules, Dev-Rewriter fixes the prompt while preserving intent, and Few-Shot-Rewriter updates examples to match the corrected rules. Pydantic models (Role, ChatMessage, Issues, FewShotIssues, MessagesOutput, DevRewriteOutput) enforce structured input/output at every step.

When to use - and when NOT to

Use this system when you have a developer prompt (with or without few-shot examples) that may contain subtle contradictions, missing format specs, or examples that drift from the stated rules - the kind of issues experienced prompt writers still introduce accidentally. You can run the code directly or use the same optimizer built into OpenAI's playground rather than the notebook version.

Inputs and outputs

The optimize_prompt_parallel workflow runs all three checker agents simultaneously first, then conditionally runs the rewriters only if their corresponding issues were found: Dev-Rewriter only fires on contradiction or format issues, and Few-Shot-Rewriter only fires on example inconsistencies, and only after the prompt itself has been rewritten. Agent instructions were tuned against 20 hand-labeled golden examples covering contradiction, few-shot, and format issues (plus clean cases), scored with a Python string-check grader for both issue detection and correct issue matching - this evaluation process selected the 4.1 model as the best accuracy/cost/speed balance and confirmed correct labels on all 20 golden cases.

Integrations

The agent-instruction design follows five practices worth reusing elsewhere: narrow, explicit scope per agent (the contradiction checker explicitly excludes redundancy, not just contradiction); a step-by-step methodology per agent; precisely defined key terms (a documented compliance rubric for the few-shot checker); explicit "out-of-scope" boundaries to prevent over-flagging; and a strictly defined structured output format with examples for every agent.

Who it's for

Developers and prompt engineers who want an automated, eval-validated way to catch and fix prompt contradictions, format ambiguity, and few-shot drift before shipping a prompt to production. The cookbook walks through three concrete before-and-after cases - resolving a direct contradiction, correcting few-shot examples that quietly violate the stated rules (which matters because models tend to imitate examples over instructions when the two conflict), and tightening an underspecified Markdown output format in a longer prompt.

Source README

Optimize Prompts

Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.

The optimization process uses a multi-agent approach with specialized AI agents collaborating to analyze and rewrite prompts. The system automatically identifies and addresses several types of common issues:

Contradictions in the prompt instructions
Missing or unclear format specifications
Inconsistencies between the prompt and few-shot examples

Objective: This cookbook demonstrates best practices for using Agents SDK together with Evals to build an early version of OpenAI's prompt optimization system. You can optimize your prompt using this code or use the optimizer in our playground!

Ask ChatGPT

Cookbook Structure
This notebook follows this structure:

Step 1. System Overview - Learn how the prompt optimization system works
Step 2. Data Models - Understand the data structures used by the system
Step 3. Defining the Agents - Look at agents that analyze and improve prompts
Step 4. Evaluations - Use Evals to verify our agent model choice and instructions
Step 5. Run Optimization Workflow - See how the workflow hands off the prompts
Step 6. Examples - Explore real-world examples of prompt optimization

Prerequisites

The openai Python package
The openai-agents package
An OpenAI API key set as OPENAI_API_KEY in your environment variables

1. System Overview

The prompt optimization system uses a collaborative multi-agent approach to analyze and improve prompts. Each agent specializes in either detecting or rewriting a specific type of issue:

Dev-Contradiction-Checker: Scans the prompt for logical contradictions or impossible instructions, like "only use positive numbers" and "include negative examples" in the same prompt.
Format-Checker: Identifies when a prompt expects structured output (like JSON, CSV, or Markdown) but fails to clearly specify the exact format requirements. This agent ensures that all necessary fields, data types, and formatting rules are explicitly defined.
Few-Shot-Consistency-Checker: Examines example conversations to ensure that the assistant's responses actually follow the rules specified in the prompt. This catches mismatches between what the prompt requires and what the examples demonstrate.
Dev-Rewriter: After issues are identified, this agent rewrites the prompt to resolve contradictions and clarify format specifications while preserving the original intent.
Few-Shot-Rewriter: Updates inconsistent example responses to align with the rules in the prompt, ensuring all examples properly comply with the new developer prompt.

By working together, these agents can systematically identify and fix issues in prompts.

2. Data Models

To facilitate structured communication between agents, the system uses Pydantic models to define the expected format for inputs and outputs. These Pydantic models help validate data and ensure consistency throughout the workflow.

The data models include:

Role - An enumeration for message roles (user/assistant)
ChatMessage - Represents a single message in a conversation
Issues - Base model for reporting detected issues
FewShotIssues - Extended model that adds rewrite suggestions for example messages
MessagesOutput - Contains optimized conversation messages
DevRewriteOutput - Contains the improved developer prompt

Using Pydantic allows the system to validate that all data conforms to the expected format at each step of the process.

3. Defining the Agents

In this section, we create specialized AI agents using the Agent class from the openai-agents package. Looking at these agent definitions reveals several best practices for creating effective AI instructions:

Best Practices in Agent Instructions

Clear Scope Definition: Each agent has a narrowly defined purpose with explicit boundaries. For example, the contradiction checker focuses only on "genuine self-contradictions" and explicitly states that "overlaps or redundancies are not contradictions."
Step-by-Step Process: Instructions provide a clear methodology, like how the format checker first categorizes the task before analyzing format requirements.
Explicit Definitions: Key terms are defined precisely to avoid ambiguity. The few-shot consistency checker includes a detailed "Compliance Rubric" explaining exactly what constitutes compliance.
Boundary Setting: Instructions specify what the agent should NOT do. The few-shot checker explicitly lists what's "Out-of-scope" to prevent over-flagging issues.
Structured Output Requirements: Each agent has a strictly defined output format with examples, ensuring consistency in the optimization pipeline.

These principles create reliable, focused agents that work effectively together in the optimization system. Below we see the complete agent definitions with their detailed instructions.

4. Using Evaluations to Arrive at These Agents

Let's see how we used OpenAI Evals to tune agent instructions and pick the correct model to use. In order to do so we constructed a set of golden examples: each one contains original messages (developer message + user/assistant message) and the changes our optimization workflow should make. Here are two example of golden pairs that we used:

[
  {
    "focus": "contradiction_issues",
    "input_payload": {
      "developer_message": "Always answer in **English**.\nNunca respondas en inglés.",
      "messages": [
        {
          "role": "user",
          "content": "¿Qué hora es?"
        }
      ]
    },
    "golden_output": {
      "changes": true,
      "new_developer_message": "Always answer **in English**.",
      "new_messages": [
        {
          "role": "user",
          "content": "¿Qué hora es?"
        }
      ],
      "contradiction_issues": "Developer message simultaneously insists on English and forbids it.",
      "few_shot_contradiction_issues": "",
      "format_issues": "",
      "general_improvements": ""
    }
  },
  {
    "focus": "few_shot_contradiction_issues",
    "input_payload": {
      "developer_message": "Respond with **only 'yes' or 'no'** – no explanations.",
      "messages": [
        {
          "role": "user",
          "content": "Is the sky blue?"
        },
        {
          "role": "assistant",
          "content": "Yes, because wavelengths …"
        },
        {
          "role": "user",
          "content": "Is water wet?"
        },
        {
          "role": "assistant",
          "content": "Yes."
        }
      ]
    },
    "golden_output": {
      "changes": true,
      "new_developer_message": "Respond with **only** the single word \"yes\" or \"no\".",
      "new_messages": [
        {
          "role": "user",
          "content": "Is the sky blue?"
        },
        {
          "role": "assistant",
          "content": "yes"
        },
        {
          "role": "user",
          "content": "Is water wet?"
        },
        {
          "role": "assistant",
          "content": "yes"
        }
      ],
      "contradiction_issues": "",
      "few_shot_contradiction_issues": "Assistant examples include explanations despite instruction not to.",
      "format_issues": "",
      "general_improvements": ""
    }
  }
]

From these 20 hand labelled golden outputs which cover a range of contradiction issues, few shot issues, format issues, no issues, or a combination of issues, we built a python string check grader to verify two things: whether an issue was detected for each golden pair and whether the detected issue matched the expected one. From this signal, we tuned the agent instructions and which model to use to maximize our accuracy across this evaluation. We landed on the 4.1 model as a balance between accuracy, cost, and speed. The specific prompts we used also follow the 4.1 prompting guide. As you can see, we achieve the correct labels on all 20 golden outputs: identifying the right issues and avoiding false positives.

5. Run Optimization Workflow

Let's dive into how the optimization system actually works end to end. The core workflow consists of multiple runs of the agents in parallel to efficiently process and optimize prompts.

Understanding the Optimization Workflow

The optimize_prompt_parallel function implements a workflow to maximize efficiency through parallelization:

Parallel Issue Detection: The first phase runs all checker agents simultaneously:
- dev_contradiction_checker searches for logical contradictions in the prompt
- format_checker looks for unclear format specifications
- fewshot_consistency_checker (if examples exist) checks for mismatches between the prompt and examples

After the parallel checking phase, the workflow handles dependencies carefully:

Prompt Rewriting (Conditional): The dev_rewriter agent only runs if contradiction or format issues were detected. This agent depends on the outputs from:
- dev_contradiction_checker (the cd_issues variable)
- format_checker (the fi_issues variable)
Example Rewriting (Conditional): The fewshot_rewriter agent only runs if example inconsistencies were detected. This agent depends on:
- The rewritten prompt (must be done after prompt rewriting)
- The original messages
- The few-shot issues (the fs_issues variable)

6. Examples

Let's see the optimization system in action with some practical examples.

Example 1: Fixing Contradictions

This demonstrates how the system can detect and resolve critical contradictions that could lead to inconsistent outputs or confusion for the model.

Example 2: Fixing Inconsistencies Between Prompt and Few-Shot Examples

This is particularly important because few-shot examples have a strong influence on how models respond. If examples don't follow the stated rules, the model may learn to ignore those rules in favor of mimicking the examples. By ensuring consistency between the prompt instructions and examples, the optimization system creats a more reliable prompt.

Example 3: Clarifying Formats in a Longer Prompt

This example highlights how the format checker identifies and resolves ambiguous format specifications. The prompt requested a Markdown output and the optimization flow significantly improved these format specifications.

FAQ

Common questions

Discussion

Refine Prompts for Reliable AI Outputs

What it gets done

Add it to your toolbox

Steps in the chain

Optimize Prompts

What it does

When to use - and when NOT to

Inputs and outputs

Integrations

Who it's for

Optimize Prompts

1. System Overview

2. Data Models

3. Defining the Agents

Best Practices in Agent Instructions

4. Using Evaluations to Arrive at These Agents

5. Run Optimization Workflow

Understanding the Optimization Workflow

6. Examples

Example 1: Fixing Contradictions

Example 2: Fixing Inconsistencies Between Prompt and Few-Shot Examples

Example 3: Clarifying Formats in a Longer Prompt

Common questions

Questions & comments · 0