Prompt Chain

Evaluate Strands Agents with Promptfoo

A prompt chain example that demonstrates how to evaluate Strands Agents SDK workflows using promptfoo's testing framework.

Works with promptfoostrands agents

54
Spark score
out of 100
Updated yesterday
Version code-scan-action-0.1

Add to Favorites

Why it matters

This asset allows developers to evaluate the performance and reliability of Strands Agents SDK integrations using the promptfoo evaluation framework. It streamlines the process of testing agent behavior and code generation capabilities.

Outcomes

What it gets done

01

Integrate Strands Agents SDK with promptfoo for automated testing.

02

Evaluate agent performance on code generation tasks.

03

Debug and refine agent logic based on evaluation results.

04

Benchmark different agent configurations.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-integration-strands-agents | bash

Capabilities

What this chain does

Generate code

Writes source code or scripts from a description.

Review code

Analyzes code for bugs, style issues, and improvements.

Debug

Traces errors to their root cause and suggests fixes.

Overview

Integration Strands Agents

What it does

This is a working example that demonstrates the integration between Strands Agents SDK and promptfoo. It provides a reference implementation showing how to configure promptfoo to evaluate agent workflows built with the Strands Agents SDK, enabling systematic testing of agent behaviors and outputs.

How it connects

Use this example when you are building agents with Strands Agents SDK and need to implement systematic evaluation and testing. It is particularly useful when you want to establish quality assurance processes for agent-based applications or need a starting point for integrating promptfoo into your Strands Agents development workflow.

Source README

integration-strands-agents (Strands Agents SDK example)

This example demonstrates how to evaluate Strands Agents SDK with promptfoo.

Strands Agents is an open-source AI agent framework developed by AWS that provides a model-driven approach to building AI agents.

You can run this example with:

npx promptfoo@latest init --example integration-strands-agents
cd integration-strands-agents

Overview

This example showcases:

Prerequisites

Setup

1. Install Python dependencies

pip install -r requirements.txt

This installs:

2. Set environment variables

export OPENAI_API_KEY=your-api-key-here

Alternative: use Anthropic or Bedrock

Strands supports multiple model providers. To use Anthropic:

pip install 'strands-agents[anthropic]'
export ANTHROPIC_API_KEY=your-key

Then modify agent.py to use AnthropicModel instead of OpenAIModel.

To use Amazon Bedrock:

pip install 'strands-agents[bedrock]'

Running the example

# Run evaluation
npx promptfoo eval

# View results in the web UI
npx promptfoo view

How it works

Agent structure

The agent is defined in agent.py using the Strands Agent class with two tools:

  • get_weather: Returns mock weather data for cities (New York, London, Tokyo, Paris, Seattle, San Francisco)
  • convert_temperature: Converts temperatures between Fahrenheit and Celsius

Tools are defined using the @tool decorator which automatically exposes them to the LLM based on their docstrings.

Provider integration

agent_provider.py exposes a call_api function that promptfoo's Python provider calls to interact with the Strands agent.

Test cases and assertion types

The promptfoo config includes 5 test cases that demonstrate different assertion types:

Test Description Assertion types used
Weather query for New York Basic tool usage contains-any, llm-rubric, latency
Weather query for London Verify temperature format contains-any, javascript, latency
Weather query for Tokyo Case-insensitive matching icontains, javascript, latency
Weather with temperature conversion Multi-tool chaining llm-rubric, javascript, latency
Weather for unknown city Graceful fallback handling icontains, not-contains, latency
Assertion types explained
  • latency - Ensures responses complete within 30 seconds (applied to all tests via defaultTest)
  • contains-any - Verifies the agent returns expected city names and weather data from the mock tool
  • icontains - Case-insensitive matching to verify city names appear regardless of formatting
  • not-contains - Ensures the agent handles unknown cities gracefully without error messages
  • javascript - Validates temperature format (°F/°C symbols) and response length requirements
  • llm-rubric - Semantically evaluates whether the agent correctly chains weather lookup with temperature conversion

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.