Multi-step prompt sequences for complex AI workflows.
14 tools found
Evaluates language model factuality using the TruthfulQA dataset to test whether models avoid generating false answers based on common misconceptions.
Enhance LLM output quality using meta prompting. Refine prompts for improved news summaries, categorization, and sentiment analysis.
Evaluate OpenAI agents with Langfuse for reliable production deployment. Monitor traces, debug, and improve performance with online/offline metrics.
Cookbook demonstrating OpenAI Evals API for audio-based model evaluation using native audio inputs, model sampling, and audio grading without transcription.
OpenAI Evals API workflow for evaluating vision model responses to image prompts using sampling and LLM-as-a-Judge grading against reference answers.
Fine-tune GPT-4o with images and text for visual question answering, enhancing image understanding for tailored solutions.
Multi-tool orchestration workflow using OpenAI Responses API to route queries between web search, external vector databases like Pinecone, and RAG retrieval
A prompt workflow that generates OpenAI embeddings for movie descriptions and uses Milvus vector database with metadata filtering to find relevant films from
Query relevant contexts from Pinecone and pass them to a generative OpenAI model to generate an answer backed by real data sources.
Search movies using OpenAI embeddings and Zilliz vector database with metadata filtering.
Demonstrates evaluating model factuality using the TruthfulQA dataset from HuggingFace. Questions are crafted to elicit common misconceptions.
Safeguard LLM inputs/outputs with Llama Guard. Classify content based on safety taxonomies.