Optimize Prompts

Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.

Get this prompt chain

Optimize Prompts

Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.

The optimization process uses a multi-agent approach with specialized AI agents collaborating to analyze and rewrite prompts. The system automatically identifies and addresses several types of common issues:

  • Contradictions in the prompt instructions
  • Missing or unclear format specifications
  • Inconsistencies between the prompt and few-shot examples

Objective: This cookbook demonstrates best practices for using Agents SDK together with Evals to build an early version of OpenAI's prompt optimization system. You can optimize your prompt using this code or use the optimizer in our playground!

Ask ChatGPT

Cookbook Structure
This notebook follows this structure:

Prerequisites

  • The openai Python package
  • The openai-agents package
  • An OpenAI API key set as OPENAI_API_KEY in your environment variables

1. System Overview

The prompt optimization system uses a collaborative multi-agent approach to analyze and improve prompts. Each agent specializes in either detecting or rewriting a specific type of issue:

  1. Dev-Contradiction-Checker: Scans the prompt for logical contradictions or impossible instructions, like "only use positive numbers" and "include negative examples" in the same prompt.

  2. Format-Checker: Identifies when a prompt expects structured output (like JSON, CSV, or Markdown) but fails to clearly specify the exact format requirements. This agent ensures that all necessary fields, data types, and formatting rules are explicitly defined.

  3. Few-Shot-Consistency-Checker: Examines example conversations to ensure that the assistant's responses actually follow the rules specified in the prompt. This catches mismatches between what the prompt requires and what the examples demonstrate.

  4. Dev-Rewriter: After issues are identified, this agent rewrites the prompt to resolve contradictions and clarify format specifications while preserving the original intent.

  5. Few-Shot-Rewriter: Updates inconsistent example responses to align with the rules in the prompt, ensuring all examples properly comply with the new developer prompt.

By working together, these agents can systematically identify and fix issues in prompts.

2. Data Models

To facilitate structured communication between agents, the system uses Pydantic models to define the expected format for inputs and outputs. These Pydantic models help validate data and ensure consistency throughout the workflow.

The data models include:

  1. Role - An enumeration for message roles (user/assistant)
  2. ChatMessage - Represents a single message in a conversation
  3. Issues - Base model for reporting detected issues
  4. FewShotIssues - Extended model that adds rewrite suggestions for example messages
  5. MessagesOutput - Contains optimized conversation messages
  6. DevRewriteOutput - Contains the improved developer prompt

Using Pydantic allows the system to validate that all data conforms to the expected format at each step of the process.

3. Defining the Agents

In this section, we create specialized AI agents using the Agent class from the openai-agents package. Looking at these agent definitions reveals several best practices for creating effective AI instructions:

Best Practices in Agent Instructions

  1. Clear Scope Definition: Each agent has a narrowly defined purpose with explicit boundaries. For example, the contradiction checker focuses only on "genuine self-contradictions" and explicitly states that "overlaps or redundancies are not contradictions."

  2. Step-by-Step Process: Instructions provide a clear methodology, like how the format checker first categorizes the task before analyzing format requirements.

  3. Explicit Definitions: Key terms are defined precisely to avoid ambiguity. The few-shot consistency checker includes a detailed "Compliance Rubric" explaining exactly what constitutes compliance.

  4. Boundary Setting: Instructions specify what the agent should NOT do. The few-shot checker explicitly lists what's "Out-of-scope" to prevent over-flagging issues.

  5. Structured Output Requirements: Each agent has a strictly defined output format with examples, ensuring consistency in the optimization pipeline.

These principles create reliable, focused agents that work effectively together in the optimization system. Below we see the complete agent definitions with their detailed instructions.

4. Using Evaluations to Arrive at These Agents

Let's see how we used OpenAI Evals to tune agent instructions and pick the correct model to use. In order to do so we constructed a set of golden examples: each one contains original messages (developer message + user/assistant message) and the changes our optimization workflow should make. Here are two example of golden pairs that we used:

[
  {
    "focus": "contradiction_issues",
    "input_payload": {
      "developer_message": "Always answer in **English**.\nNunca respondas en inglés.",
      "messages": [
        {
          "role": "user",
          "content": "¿Qué hora es?"
        }
      ]
    },
    "golden_output": {
      "changes": true,
      "new_developer_message": "Always answer **in English**.",
      "new_messages": [
        {
          "role": "user",
          "content": "¿Qué hora es?"
        }
      ],
      "contradiction_issues": "Developer message simultaneously insists on English and forbids it.",
      "few_shot_contradiction_issues": "",
      "format_issues": "",
      "general_improvements": ""
    }
  },
  {
    "focus": "few_shot_contradiction_issues",
    "input_payload": {
      "developer_message": "Respond with **only 'yes' or 'no'** – no explanations.",
      "messages": [
        {
          "role": "user",
          "content": "Is the sky blue?"
        },
        {
          "role": "assistant",
          "content": "Yes, because wavelengths …"
        },
        {
          "role": "user",
          "content": "Is water wet?"
        },
        {
          "role": "assistant",
          "content": "Yes."
        }
      ]
    },
    "golden_output": {
      "changes": true,
      "new_developer_message": "Respond with **only** the single word \"yes\" or \"no\".",
      "new_messages": [
        {
          "role": "user",
          "content": "Is the sky blue?"
        },
        {
          "role": "assistant",
          "content": "yes"
        },
        {
          "role": "user",
          "content": "Is water wet?"
        },
        {
          "role": "assistant",
          "content": "yes"
        }
      ],
      "contradiction_issues": "",
      "few_shot_contradiction_issues": "Assistant examples include explanations despite instruction not to.",
      "format_issues": "",
      "general_improvements": ""
    }
  }
]

From these 20 hand labelled golden outputs which cover a range of contradiction issues, few shot issues, format issues, no issues, or a combination of issues, we built a python string check grader to verify two things: whether an issue was detected for each golden pair and whether the detected issue matched the expected one. From this signal, we tuned the agent instructions and which model to use to maximize our accuracy across this evaluation. We landed on the 4.1 model as a balance between accuracy, cost, and speed. The specific prompts we used also follow the 4.1 prompting guide. As you can see, we achieve the correct labels on all 20 golden outputs: identifying the right issues and avoiding false positives.

Accuracy for the golden set

Evaluation for the golden set

5. Run Optimization Workflow

Let's dive into how the optimization system actually works end to end. The core workflow consists of multiple runs of the agents in parallel to efficiently process and optimize prompts.

Trace for the workflow

Understanding the Optimization Workflow

The optimize_prompt_parallel function implements a workflow to maximize efficiency through parallelization:

  1. Parallel Issue Detection: The first phase runs all checker agents simultaneously:
    • dev_contradiction_checker searches for logical contradictions in the prompt
    • format_checker looks for unclear format specifications
    • fewshot_consistency_checker (if examples exist) checks for mismatches between the prompt and examples

After the parallel checking phase, the workflow handles dependencies carefully:

  1. Prompt Rewriting (Conditional): The dev_rewriter agent only runs if contradiction or format issues were detected. This agent depends on the outputs from:

    • dev_contradiction_checker (the cd_issues variable)
    • format_checker (the fi_issues variable)
  2. Example Rewriting (Conditional): The fewshot_rewriter agent only runs if example inconsistencies were detected. This agent depends on:

    • The rewritten prompt (must be done after prompt rewriting)
    • The original messages
    • The few-shot issues (the fs_issues variable)

6. Examples

Let's see the optimization system in action with some practical examples.

Example 1: Fixing Contradictions

This demonstrates how the system can detect and resolve critical contradictions that could lead to inconsistent outputs or confusion for the model.

Example 2: Fixing Inconsistencies Between Prompt and Few-Shot Examples

This is particularly important because few-shot examples have a strong influence on how models respond. If examples don't follow the stated rules, the model may learn to ignore those rules in favor of mimicking the examples. By ensuring consistency between the prompt instructions and examples, the optimization system creats a more reliable prompt.

Example 3: Clarifying Formats in a Longer Prompt

This example highlights how the format checker identifies and resolves ambiguous format specifications. The prompt requested a Markdown output and the optimization flow significantly improved these format specifications.

Comments (0)

Sign In Sign in to leave a comment.

Spark Drops

Weekly picks: best new AI tools, agents & prompts

Venture Crew
Terms of Service

© 2026, Venture Crew