Prompt Chains

Generate Office Layouts from Floorplans

Multi-step prompt workflow that uses GPT-5.5 to generate structured office layouts from empty floorplans, furniture catalogs, and spatial constraints.

Compare GPT Model Performance

A prompt workflow that benchmarks and compares different GPT model tiers side-by-side to evaluate performance, cost, and output quality across OpenAI's model

Compare LLM Reasoning Effort

Prompt chain workflow that benchmarks OpenAI GPT model performance across different reasoning effort levels to identify optimal cost-performance trade-offs.

Workflows Prompt Chain

Test OSWorld Multimodal Agent Integration

Runs OSWorld multimodal computer-use benchmark tasks through promptfoo, testing agents that observe Ubuntu desktop screenshots and act with mouse and keyboard

Test Policy Plugin Code Generation

Test suite that evaluates the PolicyPlugin test generator to ensure it produces valid policy violation tests for AI red-teaming workflows.

Analyze and Summarize Model Performance

A prompt workflow example demonstrating how to run and test multiple Amazon Bedrock AI models using promptfoo's evaluation framework.

githubamazon bedrock

Compare GPT-4o and GPT-4o Mini Models

A prompt workflow example that runs side-by-side performance comparisons between OpenAI's GPT-4o and GPT-4o-mini models to evaluate output quality and cost

Test LLM Temperature Settings

A prompt workflow example that compares GPT model outputs across different temperature settings to evaluate response variability and consistency.

gpt 4oclaude 3 5 sonnet github

Compare LLM Performance on Riddles

Prompt workflow comparing OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, and Google Gemini 3.1 Pro Preview on riddle-solving with cost, latency, and quality

mistral largellama 3 github

Compare LLM Performance

A prompt workflow example that benchmarks and compares Mistral and Llama language models side-by-side using promptfoo's evaluation framework.

deepseek v3mistral large githubopenrouter

Compare Open Source LLMs on Key Tasks

Prompt workflow that benchmarks DeepSeek, Mistral, Llama, and Qwen models on factual assertion tasks using OpenRouter to compare open-source LLM performance.

Generate Javascript Test Cases

JavaScript-based test case configuration example for promptfoo, demonstrating how to define and run prompt evaluation tests programmatically using code instead

github