What does this prompt chain compare?

It compares the same gpt-5.4-mini model with two different reasoning settings: 'none' for low-latency tasks and 'medium' for tasks where extra deliberation may improve reliability, keeping all other settings identical.

What metrics does the eval table show?

The eval table compares output quality, latency, and cost between the two reasoning effort settings to help you decide if a task benefits from reasoning tokens or can run faster and cheaper with reasoning turned off.

Who should use this example?

Teams deciding whether a specific task needs gpt-5.4-mini's reasoning effort enabled, who want a controlled single-variable comparison before committing to the extra latency and cost.

Prompt Chain

Compare LLM Reasoning Effort

Name: Compare Gpt Reasoning Effort
Availability: OnlineOnly
Author: Promptfoo

Promptfoo example that compares OpenAI gpt-5.4-mini with two reasoning effort levels (none vs. medium) to evaluate output quality, latency, and cost.

Copy chain

Works with github

Promptfoo

Own this? Claim it

Spark score

out of 100

Updated 6 days ago

Version 0.121.19

Models

gpt 4o

Add to Favorites

Why it matters

Evaluate and compare the reasoning capabilities of different large language models. This asset helps you understand how effectively models can process and reason through complex prompts.

Outcomes

What it gets done

Run prompts against multiple LLMs.

Analyze and compare model outputs for reasoning quality.

Generate code examples for prompt execution.

Facilitate code review of prompt execution logic.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-compare-gpt-reasoning-effort | bash

Steps

Steps in the chain

Initialize example project

Navigate to project directory

Set OpenAI API key

Run evaluation

Overview

Compare Gpt Reasoning Effort

This prompt chain compares the same gpt-5.4-mini Responses API model with two different reasoning effort settings: none for low-latency tasks that do not need reasoning tokens, and medium for tasks where extra deliberation may improve reliability. The evaluation keeps the provider, prompt, verbosity, and output limit identical across both configurations, making it easy to isolate and compare output quality, latency, and cost for each reasoning effort level. Use this example when you need to decide whether enabling reasoning tokens justifies the added latency and cost for your specific use case. It is ideal when you are evaluating whether your tasks benefit from extra deliberation (medium reasoning) or can rely on faster, direct responses (no reasoning). Do NOT use this if you are working with models other than gpt-5.4-mini or if you need to compare across different model families - this example is specifically designed to isolate the reasoning effort variable within a single model.

What it does

This prompt chain compares the same gpt-5.4-mini Responses API model with two different reasoning effort settings: none for low-latency tasks that do not need reasoning tokens, and medium for tasks where extra deliberation may improve reliability. The evaluation keeps the provider, prompt, verbosity, and output limit identical across both configurations, making it easy to isolate and compare output quality, latency, and cost for each reasoning effort level.

When to use - and when NOT to

Use this example when you need to decide whether enabling reasoning tokens justifies the added latency and cost for your specific use case. It is ideal when you are evaluating whether your tasks benefit from extra deliberation (medium reasoning) or can rely on faster, direct responses (no reasoning). Do NOT use this if you are working with models other than gpt-5.4-mini or if you need to compare across different model families - this example is specifically designed to isolate the reasoning effort variable within a single model.

Inputs and outputs

You provide an OpenAI API key via the OPENAI_API_KEY environment variable. The example runs a set of evaluation prompts against both reasoning configurations. You receive a side-by-side evaluation table that displays output quality, latency metrics, and cost data for each reasoning effort setting, allowing direct comparison of the trade-offs.

Integrations

This example integrates with:

Promptfoo: The evaluation framework that orchestrates the comparison, runs the tests, and generates the evaluation table
OpenAI API: Specifically the gpt-5.4-mini model via the Responses API, which supports configurable reasoning effort levels

Who it's for

This example is for AI engineers and product teams who are optimizing their OpenAI API usage and need empirical data to choose the right reasoning effort setting. It is particularly valuable for developers who want to understand the performance and cost implications of reasoning tokens before deploying gpt-5.4-mini in production. If you are building latency-sensitive applications or working within tight budget constraints, this comparison helps you make an informed decision about whether the quality improvements from medium reasoning justify the additional overhead.

Getting started

To run this example, initialize it with:

npx promptfoo@latest init --example compare-gpt-reasoning-effort
cd compare-gpt-reasoning-effort

Set your OPENAI_API_KEY environment variable, then execute the evaluation:

promptfoo eval --no-cache

The --no-cache flag ensures fresh API calls for accurate latency and cost measurements.

Source README

compare-gpt-reasoning-effort (GPT Reasoning Effort Comparison)

You can run this example with:

npx promptfoo@latest init --example compare-gpt-reasoning-effort
cd compare-gpt-reasoning-effort

Usage

This example compares the same gpt-5.4-mini Responses API model with two reasoning settings:

none for low-latency tasks that do not need reasoning tokens
medium for tasks where extra deliberation may improve reliability

Set OPENAI_API_KEY, then run:

promptfoo eval --no-cache

The provider, prompt, verbosity, and output limit are otherwise identical, so the eval table makes it easier to compare output quality, latency, and cost for each reasoning effort.

FAQ

Common questions

Discussion