Prompt Chains

Build Reliable Agents with Memory and Compaction

OpenAI Agents SDK cookbook showing how to build evidence-review agents with compaction for long conversations and memory for reusable workflow lessons.

gpt 4o gpt-4o

59 yesterday

Retrieve Knowledge with Function Calling

A fast-executing test fixture demonstrating function calling for knowledge retrieval using two local paper records and legacy tool-calling patterns from the

gpt 4 gpt-4

Compare LLM Reasoning Effort

Prompt chain workflow that benchmarks OpenAI GPT model performance across different reasoning effort levels to identify optimal cost-performance trade-offs.

gpt 4o github

Optimize GPT-5.4 for Document Understanding

Master GPT-5.4's multimodal features for document understanding by optimizing image detail, verbosity, reasoning, and tool usage for SOTA results.

gpt 4o gpt-4o

59 yesterday

Analyze and Summarize Model Performance

A prompt workflow example demonstrating how to run and test multiple Amazon Bedrock AI models using promptfoo's evaluation framework.

githubamazon bedrock

claudegpt 4o claudegpt-4o

Compare LLM Performance

A prompt workflow example that benchmarks Claude and GPT models side-by-side using promptfoo's evaluation framework to compare outputs on identical prompts.

54 2 days ago

Compare GPT-4o and GPT-4o Mini Models

A prompt workflow example that runs side-by-side performance comparisons between OpenAI's GPT-4o and GPT-4o-mini models to evaluate output quality and cost

gpt 4o github

Compare LLM Performance on MMLU Benchmark

Prompt workflow that benchmarks GPT-5 against GPT-5 Mini using MMLU (Massive Multitask Language Understanding) test suite to compare model performance across

gpt 4o gpt-4o

gpt 4oclaude 3 5 sonnet github

Compare LLM Performance on Riddles

Prompt workflow comparing OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, and Google Gemini 3.1 Pro Preview on riddle-solving with cost, latency, and quality

gpt 4ollama 3 gpt-4ollama-3

Compare LLM Performance

A prompt workflow example that compares responses from Llama and GPT language models side-by-side using promptfoo's evaluation framework.

mistral largellama 3 github

Compare LLM Performance

A prompt workflow example that benchmarks and compares Mistral and Llama language models side-by-side using promptfoo's evaluation framework.

deepseek v3mistral large githubopenrouter

Compare Open Source LLMs on Key Tasks

Prompt workflow that benchmarks DeepSeek, Mistral, Llama, and Qwen models on factual assertion tasks using OpenRouter to compare open-source LLM performance.

54 2 days ago