Prompt Migration Guide

Newer models, such as GPT-4.1, are best in class in performance and instruction following. As model gets smarter, there is a consistent need to adapt prompts that were originally tailored to earlier models' limitations, ensuring they remain effective and clear for newer generations.

Get this prompt chain

Prompt Migration Guide

Newer models, such as GPT-4.1, are best in class in performance and instruction following. As model gets smarter, there is a consistent need to adapt prompts that were originally tailored to earlier models' limitations, ensuring they remain effective and clear for newer generations.

Models such as GPT‑4.1 excel at closely following instructions, but this precision means it can interpret unclear or poorly phrased instructions literally, leading to unexpected or incorrect results. To leverage GPT‑4.1's full potential, it's essential to refine prompts, ensuring each instruction is explicit, unambiguous, and aligned with your intended outcomes.


Example of Unclear Instructions:

  • Ambiguous:

""Do not include irrelevant information.""

Issue: GPT-4.1 might struggle to determine what is "irrelevant" if not explicitly defined. This could cause it to omit essential details due to overly cautious interpretation or include too much detail inadvertently..

  • Improved:

"Only include facts directly related to the main topic (X). Exclude personal anecdotes, unrelated historical context, or side discussions."


Objective: This interactive notebook helps you improve an existing prompt (written for another model) into one that is clear, unambiguous and optimised for GPT‑4.1 following best practices.

Workflow Overview
This notebook uses the following approach:

Prerequisites

  • The openai Python package and OPENAI_API_KEY

Below are a few helper functions to enable us to easily review the analysis and modifications on our prompt.

Step 1. Input Your Original Prompt

Begin by providing your existing prompt clearly between triple quotes ("""). This prompt will serve as the baseline for improvement.

For this example, we will be using the system prompt for LLM-as-a-Judge provided in the following paper.

Step 2. Identify All Instructions in your Prompt

In this section, we will extract every INSTRUCTION that the LLM identifies within the system prompt. This allows you to review the list, spot any statements that should not be instructions, and clarify any that are ambiguous.

Carefully review and confirm that each listed instruction is both accurate and essential to retain.

It's helpful to examine which parts of your prompt the model recognizes as instructions. Instructions are how we "program" models using natural language, so it's crucial to ensure they're clear, precise, and correct.

Step 3. Ask GPT-4.1 to critique the prompt

Next, GPT‑4.1 itself will critique the original prompt, specifically identifying areas that may cause confusion or errors:

  • Ambiguity: Phrases open to multiple interpretations.

  • Lacking Definitions: Labels or terms that are not clearly defined, which may cause the model to infer or guess their intended meaning.

  • Conflicting Instructions: Rules or conditions that contradict or overlap.

  • Missing Context or Assumptions: Necessary information or context not explicitly provided.

The critique output will be clearly organized, highlighting specific issues along with actionable suggestions for improvement.

Models are really good at identifying parts of a prompt that they find ambiguous or confusing. By addressing these issues, we can engineer the instructions to make them clearer and more effective for the model.

Review the list of issues:

  • If you are satisfied with them, proceed to next step #4.
  • If you believe some issues are not relevant, copy the above text into the next cell and remove those issues. In this case, all three issues make reasonable sense, so we skip this step.

Step 4. Auto‑generate a revised system prompt

We now feed the critique back to GPT‑4.1 and ask it to produce an improved version of the original prompt, ready to drop into a system role message.

Let's review the changes side-by-side comparison highlighting changes between the improved and refined prompts:

Step 5. Evaluate and iterate

Finally, evaluate your refined prompt by:

  • Testing it with representative evaluation examples or data.

  • Analyzing the responses to ensure desired outcomes.

  • Iterating through previous steps if further improvements are required.

Consistent testing and refinement ensure your prompts consistently achieve their intended results.

Current Example

Let’s evaluate whether our current prompt migration has actually improved for the task of this judge. The original prompt, drawn from this paper, is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated ground truths, so we can measure how often the LLM judge agrees with the humans judgments.

Thus, our metric of success will be measuring how closely the judgments generated by our migrated prompt align with human evaluations compared to the judgments generated with our baseline prompt. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.

On our evaluation subset, a useful reference anchor is human-human agreement, since each conversation is rated by multiple annotators. Humans do not always agree with each other on which assistant answer is better, so we wouldn't expect our judge to achieve perfect agreement either. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases.

Graph 3 for Model Agreement

Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human ceiling. Switching to GPT-4.1 (using the same prompt) improves the agreement: 77% on turn 1 and 72% on turn 2. Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators.

Viewed all together, we can see that prompt migration and upgrading to more powerful models improve agreement on our sample task. Go ahead and try it on your prompt now!

Step 6. (OPTIONAL) Automatically Apply GPT‑4.1 Best Practices

In this step, GPT-4.1 best practices will be applied automatically to enhance your original prompt. We strongly suggest to manually review the edits made and decide if you want to keep or not.

See the 4.1 Prompting Guide for reference.

Comments (0)

Sign In Sign in to leave a comment.