Prompt Chain

Evaluate Strands Agents with Promptfoo

Name: Evaluate Strands Agents with Promptfoo
Availability: OnlineOnly
Author: Promptfoo

A prompt chain example that demonstrates how to evaluate Strands Agents SDK workflows using promptfoo's testing framework.

Copy chain

Works with promptfoostrands agents

Promptfoo

Maintainer?

Spark score

out of 100

Updated yesterday

Version code-scan-action-0.1

Add to Favorites

Why it matters

This asset allows developers to evaluate the performance and reliability of Strands Agents SDK integrations using the promptfoo evaluation framework. It streamlines the process of testing agent behavior and code generation capabilities.

Outcomes

What it gets done

Integrate Strands Agents SDK with promptfoo for automated testing.

Evaluate agent performance on code generation tasks.

Debug and refine agent logic based on evaluation results.

Benchmark different agent configurations.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-integration-strands-agents | bash

Capabilities

What this chain does

Generate code

Writes source code or scripts from a description.

Review code

Analyzes code for bugs, style issues, and improvements.

Debug

Traces errors to their root cause and suggests fixes.

Overview

Integration Strands Agents

What it does

This is a working example that demonstrates the integration between Strands Agents SDK and promptfoo. It provides a reference implementation showing how to configure promptfoo to evaluate agent workflows built with the Strands Agents SDK, enabling systematic testing of agent behaviors and outputs.

How it connects

Use this example when you are building agents with Strands Agents SDK and need to implement systematic evaluation and testing. It is particularly useful when you want to establish quality assurance processes for agent-based applications or need a starting point for integrating promptfoo into your Strands Agents development workflow.

Source README

integration-strands-agents (Strands Agents SDK example)

This example demonstrates how to evaluate Strands Agents SDK with promptfoo.

Strands Agents is an open-source AI agent framework developed by AWS that provides a model-driven approach to building AI agents.

You can run this example with:

npx promptfoo@latest init --example integration-strands-agents
cd integration-strands-agents

Overview

This example showcases:

Creating a Strands agent with custom tools
Using the @tool decorator to define agent capabilities
Evaluating agent responses with various promptfoo assertions
Testing tool usage with mock weather and temperature conversion tools

Prerequisites

Python 3.9+
OpenAI API key (default) or other supported provider

Setup

1. Install Python dependencies

pip install -r requirements.txt

This installs:

strands-agents[openai] - The Strands Agents SDK with OpenAI support
pydantic - Data validation library required by Strands

2. Set environment variables

export OPENAI_API_KEY=your-api-key-here

Alternative: use Anthropic or Bedrock

Strands supports multiple model providers. To use Anthropic:

pip install 'strands-agents[anthropic]'
export ANTHROPIC_API_KEY=your-key

Then modify agent.py to use AnthropicModel instead of OpenAIModel.

To use Amazon Bedrock:

pip install 'strands-agents[bedrock]'

Running the example

# Run evaluation
npx promptfoo eval

# View results in the web UI
npx promptfoo view

How it works

Agent structure

The agent is defined in agent.py using the Strands Agent class with two tools:

get_weather: Returns mock weather data for cities (New York, London, Tokyo, Paris, Seattle, San Francisco)
convert_temperature: Converts temperatures between Fahrenheit and Celsius

Tools are defined using the @tool decorator which automatically exposes them to the LLM based on their docstrings.

Provider integration

agent_provider.py exposes a call_api function that promptfoo's Python provider calls to interact with the Strands agent.

Test cases and assertion types

The promptfoo config includes 5 test cases that demonstrate different assertion types:

Test	Description	Assertion types used
Weather query for New York	Basic tool usage	`contains-any`, `llm-rubric`, `latency`
Weather query for London	Verify temperature format	`contains-any`, `javascript`, `latency`
Weather query for Tokyo	Case-insensitive matching	`icontains`, `javascript`, `latency`
Weather with temperature conversion	Multi-tool chaining	`llm-rubric`, `javascript`, `latency`
Weather for unknown city	Graceful fallback handling	`icontains`, `not-contains`, `latency`

Assertion types explained

latency - Ensures responses complete within 30 seconds (applied to all tests via defaultTest)
contains-any - Verifies the agent returns expected city names and weather data from the mock tool
icontains - Case-insensitive matching to verify city names appear regardless of formatting
not-contains - Ensures the agent handles unknown cities gracefully without error messages
javascript - Validates temperature format (°F/°C symbols) and response length requirements
llm-rubric - Semantically evaluates whether the agent correctly chains weather lookup with temperature conversion

Discussion

Evaluate Strands Agents with Promptfoo

What it gets done

Add it to your toolbox

What this chain does

Integration Strands Agents

What it does

How it connects

integration-strands-agents (Strands Agents SDK example)

Overview

Prerequisites

Setup

1. Install Python dependencies

2. Set environment variables

Alternative: use Anthropic or Bedrock

Running the example

How it works

Agent structure

Provider integration

Test cases and assertion types

Assertion types explained

Questions & comments · 0