Prompt Chain

Compare Claude Model Reasoning

See Claude's step-by-step reasoning before the final answer. Compares Claude Sonnet 4 and Haiku 4.5 thinking outputs.

Works with anthropicaws

54
Spark score
out of 100
Updated yesterday
Version code-scan-action-0.1
Models
claude 3 5 sonnet

Add to Favorites

Why it matters

Understand and compare the step-by-step reasoning processes of different Claude AI models. Gain insights into how models like Sonnet and Haiku arrive at their conclusions before presenting a final answer.

Outcomes

What it gets done

01

Analyze Claude Sonnet 4's thinking process.

02

Analyze Claude Haiku 4.5's thinking process.

03

Compare reasoning outputs between models.

04

Extract key steps from model thought processes.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-claude-thinking | bash

Capabilities

What this chain does

Summarize

Condenses long documents or threads into key takeaways.

Extract

Pulls structured data fields from unstructured text.

Classify

Labels or categorizes text, files, or data points.

Overview

Claude Thinking

What it does

This prompt chain demonstrates Claude's 'thinking' capability, allowing users to observe the model's step-by-step reasoning process before it generates a final answer. It compares thinking outputs from Claude Sonnet 4 (Anthropic API) and Claude Haiku 4.5 (AWS Bedrock).

How it connects

Use this prompt chain when you need to understand the reasoning behind Claude's responses, especially for complex tasks where transparency is crucial. It's ideal for educational purposes or when comparing different model behaviors. Do not use this if you only need a quick, final answer without insight into the reasoning process, or if the task is too simple to benefit from a detailed thought process.

Source README

claude-thinking (Claude Thinking)

This example demonstrates Claude's "thinking" capability, which allows you to see the model's step-by-step reasoning process before it provides a final answer. The example compares thinking outputs from Claude Sonnet 4 (Anthropic API) and Claude Haiku 4.5 (AWS Bedrock).

You can run this example with:

npx promptfoo@latest init --example claude-thinking
cd claude-thinking

What This Example Demonstrates

  • Using Claude's thinking feature to reveal step-by-step reasoning
  • Comparing thinking output quality between different Claude models
  • Comparing Anthropic API vs AWS Bedrock providers
  • Configuring the thinking token budget
  • Using LLM-based evaluation rubrics to assess reasoning quality

Environment Variables

This example requires:

For Anthropic API

For AWS Bedrock

  • AWS_ACCESS_KEY_ID - Your AWS access key
  • AWS_SECRET_ACCESS_KEY - Your AWS secret key
  • Or configure credentials via the AWS CLI: aws configure

Running the Example

After setting up environment variables:

# From the example directory
promptfoo eval
promptfoo view

Test Cases

This example includes several test cases of increasing complexity:

  1. 8 Balls Problem - A classic logic puzzle requiring careful reasoning
  2. Train Meeting Problem - A traditional algebra word problem

These test cases are specifically designed to showcase Claude's ability to break down complex problems and show detailed thinking steps.

How Claude Thinking Works

The thinking feature is enabled by setting special parameters in the provider configuration:

thinking:
  type: 'enabled'
  budget_tokens: 4096 # Controls how many tokens are allocated for thinking
max_tokens: 8192 # Must be greater than budget_tokens

When enabled, Claude's response will include a "Thinking:" section that shows its reasoning process before the final answer:

Thinking: Let me solve this step by step...
1. First, I'll divide the 8 balls into three groups...
2. In the first weighing, I'll compare groups A and B...
3. Based on the result, I can determine...

Final answer: We need exactly 2 weighings to find the heavier ball.

Additional Resources

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.