Prompt Chain

Test LLM Tool Use

Evaluation framework for testing LLM function and tool calling capabilities using promptfoo, enabling systematic assessment of model tool-use performance.

Works with github

54
Spark score
out of 100
Updated yesterday
Version code-scan-action-0.1
Models
claude 3 5 sonnet

Add to Favorites

Why it matters

Evaluate and ensure your Large Language Models can reliably use external tools and functions. This asset helps you test their ability to call and integrate with your existing software.

Outcomes

What it gets done

01

Automate the testing of LLM function/tool calling.

02

Verify LLM adherence to tool specifications.

03

Generate test cases for tool interactions.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-eval-tool-use | bash

Capabilities

What this chain does

Extract

Pulls structured data fields from unstructured text.

Review code

Analyzes code for bugs, style issues, and improvements.

Write tests

Creates unit, integration, or end-to-end test cases.

Overview

Eval Tool Use

What it does

This evaluation workflow tests how well language models invoke functions and tools using the promptfoo framework. It provides a systematic approach to assess LLM function calling capabilities, measuring whether models correctly select and execute available tools in response to prompts.

How it connects

Use this when you need to validate LLM function calling behavior before deployment, benchmark different models' tool-use accuracy, or establish regression tests for tool integration. It fits scenarios where reliable function invocation is critical to your application's success.

Source README

eval-tool-use (Function and Tool Calling)

This example demonstrates how to evaluate LLM function/tool calling capabilities using promptfoo.

You can run this example with:

npx promptfoo@latest init --example eval-tool-use
cd eval-tool-use

Overview

This example shows how to configure and test function/tool calling capabilities across multiple LLM providers:

  • OpenAI (with native function calling)
  • Anthropic (with Claude's tool use)
  • AWS Bedrock models
  • Groq (with function calling)

Each provider has slightly different syntax and requirements for implementing function/tool calling.

Environment Variables

This example requires the following environment variables:

  • OPENAI_API_KEY - Your OpenAI API key
  • ANTHROPIC_API_KEY - Your Anthropic API key
  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY - For AWS Bedrock (if using the Bedrock example)
  • GROQ_API_KEY - If using Groq's LLaMA models

You can set these in a .env file or directly in your environment.

Provider Documentation

Each provider implements tool use with different syntax:

Running the Example

The configuration for this example is in:

  • promptfooconfig.yaml - Main example with OpenAI, Anthropic, and Groq
  • promptfooconfig.bedrock.yaml - Example specifically for AWS Bedrock models

To run the main example:

promptfoo eval

To run the Bedrock example:

promptfoo eval -c promptfooconfig.bedrock.yaml

After running the evaluation, view the results with:

promptfoo view

Example Tool: Weather Function

This example uses a simple weather lookup function that takes a location and optionally a temperature unit. The example illustrates how different providers handle the same function definition with different syntaxes.

External tools can also be loaded from separate files, as demonstrated with external_tools.yaml.

Anthropic Strict Mode

The Anthropic provider includes an example with strict: true enabled, which uses Anthropic's structured outputs feature to guarantee that tool parameters exactly match your schema. This is useful for:

  • Building reliable agentic workflows
  • Ensuring type-safe function calls
  • Production systems that require guaranteed schema conformance

When strict: true is enabled, Claude will always return tool inputs that strictly follow your input_schema, with no type mismatches or missing required fields. See the Anthropic structured outputs example for more details.

Finish Reason Assertions

This example also demonstrates the use of finish-reason assertions to validate why a model stopped generating:

  • tool_calls: Verifies the model stopped to make a function/tool call (e.g., weather lookup for cities)

The example shows that when models are asked about weather in real cities (Boston, New York, Paris), they correctly stop generation to make tool calls, resulting in a tool_calls finish reason. This helps ensure your models are using tools appropriately when they should be.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.