Optimize Prompts
Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.
Get this prompt chain
Optimize Prompts
Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.
The optimization process uses a multi-agent approach with specialized AI agents collaborating to analyze and rewrite prompts. The system automatically identifies and addresses several types of common issues:
- Contradictions in the prompt instructions
- Missing or unclear format specifications
- Inconsistencies between the prompt and few-shot examples
Objective: This cookbook demonstrates best practices for using Agents SDK together with Evals to build an early version of OpenAI's prompt optimization system. You can optimize your prompt using this code or use the optimizer in our playground!
Ask ChatGPT
Cookbook Structure
This notebook follows this structure:
- Step 1. System Overview - Learn how the prompt optimization system works
- Step 2. Data Models - Understand the data structures used by the system
- Step 3. Defining the Agents - Look at agents that analyze and improve prompts
- Step 4. Evaluations - Use Evals to verify our agent model choice and instructions
- Step 5. Run Optimization Workflow - See how the workflow hands off the prompts
- Step 6. Examples - Explore real-world examples of prompt optimization
Prerequisites
- The
openaiPython package - The
openai-agentspackage - An OpenAI API key set as
OPENAI_API_KEYin your environment variables
1. System Overview
The prompt optimization system uses a collaborative multi-agent approach to analyze and improve prompts. Each agent specializes in either detecting or rewriting a specific type of issue:
Dev-Contradiction-Checker: Scans the prompt for logical contradictions or impossible instructions, like "only use positive numbers" and "include negative examples" in the same prompt.
Format-Checker: Identifies when a prompt expects structured output (like JSON, CSV, or Markdown) but fails to clearly specify the exact format requirements. This agent ensures that all necessary fields, data types, and formatting rules are explicitly defined.
Few-Shot-Consistency-Checker: Examines example conversations to ensure that the assistant's responses actually follow the rules specified in the prompt. This catches mismatches between what the prompt requires and what the examples demonstrate.
Dev-Rewriter: After issues are identified, this agent rewrites the prompt to resolve contradictions and clarify format specifications while preserving the original intent.
Few-Shot-Rewriter: Updates inconsistent example responses to align with the rules in the prompt, ensuring all examples properly comply with the new developer prompt.
By working together, these agents can systematically identify and fix issues in prompts.
2. Data Models
To facilitate structured communication between agents, the system uses Pydantic models to define the expected format for inputs and outputs. These Pydantic models help validate data and ensure consistency throughout the workflow.
The data models include:
- Role - An enumeration for message roles (user/assistant)
- ChatMessage - Represents a single message in a conversation
- Issues - Base model for reporting detected issues
- FewShotIssues - Extended model that adds rewrite suggestions for example messages
- MessagesOutput - Contains optimized conversation messages
- DevRewriteOutput - Contains the improved developer prompt
Using Pydantic allows the system to validate that all data conforms to the expected format at each step of the process.
3. Defining the Agents
In this section, we create specialized AI agents using the Agent class from the openai-agents package. Looking at these agent definitions reveals several best practices for creating effective AI instructions:
Best Practices in Agent Instructions
Clear Scope Definition: Each agent has a narrowly defined purpose with explicit boundaries. For example, the contradiction checker focuses only on "genuine self-contradictions" and explicitly states that "overlaps or redundancies are not contradictions."
Step-by-Step Process: Instructions provide a clear methodology, like how the format checker first categorizes the task before analyzing format requirements.
Explicit Definitions: Key terms are defined precisely to avoid ambiguity. The few-shot consistency checker includes a detailed "Compliance Rubric" explaining exactly what constitutes compliance.
Boundary Setting: Instructions specify what the agent should NOT do. The few-shot checker explicitly lists what's "Out-of-scope" to prevent over-flagging issues.
Structured Output Requirements: Each agent has a strictly defined output format with examples, ensuring consistency in the optimization pipeline.
These principles create reliable, focused agents that work effectively together in the optimization system. Below we see the complete agent definitions with their detailed instructions.
4. Using Evaluations to Arrive at These Agents
Let's see how we used OpenAI Evals to tune agent instructions and pick the correct model to use. In order to do so we constructed a set of golden examples: each one contains original messages (developer message + user/assistant message) and the changes our optimization workflow should make. Here are two example of golden pairs that we used:
[
{
"focus": "contradiction_issues",
"input_payload": {
"developer_message": "Always answer in **English**.\nNunca respondas en inglés.",
"messages": [
{
"role": "user",
"content": "¿Qué hora es?"
}
]
},
"golden_output": {
"changes": true,
"new_developer_message": "Always answer **in English**.",
"new_messages": [
{
"role": "user",
"content": "¿Qué hora es?"
}
],
"contradiction_issues": "Developer message simultaneously insists on English and forbids it.",
"few_shot_contradiction_issues": "",
"format_issues": "",
"general_improvements": ""
}
},
{
"focus": "few_shot_contradiction_issues",
"input_payload": {
"developer_message": "Respond with **only 'yes' or 'no'** – no explanations.",
"messages": [
{
"role": "user",
"content": "Is the sky blue?"
},
{
"role": "assistant",
"content": "Yes, because wavelengths …"
},
{
"role": "user",
"content": "Is water wet?"
},
{
"role": "assistant",
"content": "Yes."
}
]
},
"golden_output": {
"changes": true,
"new_developer_message": "Respond with **only** the single word \"yes\" or \"no\".",
"new_messages": [
{
"role": "user",
"content": "Is the sky blue?"
},
{
"role": "assistant",
"content": "yes"
},
{
"role": "user",
"content": "Is water wet?"
},
{
"role": "assistant",
"content": "yes"
}
],
"contradiction_issues": "",
"few_shot_contradiction_issues": "Assistant examples include explanations despite instruction not to.",
"format_issues": "",
"general_improvements": ""
}
}
]
From these 20 hand labelled golden outputs which cover a range of contradiction issues, few shot issues, format issues, no issues, or a combination of issues, we built a python string check grader to verify two things: whether an issue was detected for each golden pair and whether the detected issue matched the expected one. From this signal, we tuned the agent instructions and which model to use to maximize our accuracy across this evaluation. We landed on the 4.1 model as a balance between accuracy, cost, and speed. The specific prompts we used also follow the 4.1 prompting guide. As you can see, we achieve the correct labels on all 20 golden outputs: identifying the right issues and avoiding false positives.


5. Run Optimization Workflow
Let's dive into how the optimization system actually works end to end. The core workflow consists of multiple runs of the agents in parallel to efficiently process and optimize prompts.

Understanding the Optimization Workflow
The optimize_prompt_parallel function implements a workflow to maximize efficiency through parallelization:
- Parallel Issue Detection: The first phase runs all checker agents simultaneously:
dev_contradiction_checkersearches for logical contradictions in the promptformat_checkerlooks for unclear format specificationsfewshot_consistency_checker(if examples exist) checks for mismatches between the prompt and examples
After the parallel checking phase, the workflow handles dependencies carefully:
Prompt Rewriting (Conditional): The
dev_rewriteragent only runs if contradiction or format issues were detected. This agent depends on the outputs from:dev_contradiction_checker(thecd_issuesvariable)format_checker(thefi_issuesvariable)
Example Rewriting (Conditional): The
fewshot_rewriteragent only runs if example inconsistencies were detected. This agent depends on:- The rewritten prompt (must be done after prompt rewriting)
- The original messages
- The few-shot issues (the
fs_issuesvariable)
6. Examples
Let's see the optimization system in action with some practical examples.
Example 1: Fixing Contradictions
This demonstrates how the system can detect and resolve critical contradictions that could lead to inconsistent outputs or confusion for the model.
Example 2: Fixing Inconsistencies Between Prompt and Few-Shot Examples
This is particularly important because few-shot examples have a strong influence on how models respond. If examples don't follow the stated rules, the model may learn to ignore those rules in favor of mimicking the examples. By ensuring consistency between the prompt instructions and examples, the optimization system creats a more reliable prompt.
Example 3: Clarifying Formats in a Longer Prompt
This example highlights how the format checker identifies and resolves ambiguous format specifications. The prompt requested a Markdown output and the optimization flow significantly improved these format specifications.