Evaluate Strands Agents with Promptfoo
A prompt chain example that demonstrates how to evaluate Strands Agents SDK workflows using promptfoo's testing framework.
Why it matters
This asset allows developers to evaluate the performance and reliability of Strands Agents SDK integrations using the promptfoo evaluation framework. It streamlines the process of testing agent behavior and code generation capabilities.
Outcomes
What it gets done
Integrate Strands Agents SDK with promptfoo for automated testing.
Evaluate agent performance on code generation tasks.
Debug and refine agent logic based on evaluation results.
Benchmark different agent configurations.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/pfoo-integration-strands-agents | bash Capabilities
What this chain does
Writes source code or scripts from a description.
Analyzes code for bugs, style issues, and improvements.
Traces errors to their root cause and suggests fixes.
Overview
Integration Strands Agents
What it does
This is a working example that demonstrates the integration between Strands Agents SDK and promptfoo. It provides a reference implementation showing how to configure promptfoo to evaluate agent workflows built with the Strands Agents SDK, enabling systematic testing of agent behaviors and outputs.
How it connects
Use this example when you are building agents with Strands Agents SDK and need to implement systematic evaluation and testing. It is particularly useful when you want to establish quality assurance processes for agent-based applications or need a starting point for integrating promptfoo into your Strands Agents development workflow.
Source README
integration-strands-agents (Strands Agents SDK example)
This example demonstrates how to evaluate Strands Agents SDK with promptfoo.
Strands Agents is an open-source AI agent framework developed by AWS that provides a model-driven approach to building AI agents.
You can run this example with:
npx promptfoo@latest init --example integration-strands-agents
cd integration-strands-agents
Overview
This example showcases:
- Creating a Strands agent with custom tools
- Using the
@tooldecorator to define agent capabilities - Evaluating agent responses with various promptfoo assertions
- Testing tool usage with mock weather and temperature conversion tools
Prerequisites
- Python 3.9+
- OpenAI API key (default) or other supported provider
Setup
1. Install Python dependencies
pip install -r requirements.txt
This installs:
strands-agents[openai]- The Strands Agents SDK with OpenAI supportpydantic- Data validation library required by Strands
2. Set environment variables
export OPENAI_API_KEY=your-api-key-here
Alternative: use Anthropic or Bedrock
Strands supports multiple model providers. To use Anthropic:
pip install 'strands-agents[anthropic]'
export ANTHROPIC_API_KEY=your-key
Then modify agent.py to use AnthropicModel instead of OpenAIModel.
To use Amazon Bedrock:
pip install 'strands-agents[bedrock]'
Running the example
# Run evaluation
npx promptfoo eval
# View results in the web UI
npx promptfoo view
How it works
Agent structure
The agent is defined in agent.py using the Strands Agent class with two tools:
get_weather: Returns mock weather data for cities (New York, London, Tokyo, Paris, Seattle, San Francisco)convert_temperature: Converts temperatures between Fahrenheit and Celsius
Tools are defined using the @tool decorator which automatically exposes them to the LLM based on their docstrings.
Provider integration
agent_provider.py exposes a call_api function that promptfoo's Python provider calls to interact with the Strands agent.
Test cases and assertion types
The promptfoo config includes 5 test cases that demonstrate different assertion types:
| Test | Description | Assertion types used |
|---|---|---|
| Weather query for New York | Basic tool usage | contains-any, llm-rubric, latency |
| Weather query for London | Verify temperature format | contains-any, javascript, latency |
| Weather query for Tokyo | Case-insensitive matching | icontains, javascript, latency |
| Weather with temperature conversion | Multi-tool chaining | llm-rubric, javascript, latency |
| Weather for unknown city | Graceful fallback handling | icontains, not-contains, latency |
Assertion types explained
latency- Ensures responses complete within 30 seconds (applied to all tests viadefaultTest)contains-any- Verifies the agent returns expected city names and weather data from the mock toolicontains- Case-insensitive matching to verify city names appear regardless of formattingnot-contains- Ensures the agent handles unknown cities gracefully without error messagesjavascript- Validates temperature format (°F/°C symbols) and response length requirementsllm-rubric- Semantically evaluates whether the agent correctly chains weather lookup with temperature conversion
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.