Test OSWorld Multimodal Agent Integration
Runs OSWorld multimodal computer-use benchmark tasks through promptfoo, testing agents that observe Ubuntu desktop screenshots and act with mouse and keyboard
Why it matters
Integrate and test the OSWorld multimodal benchmark using promptfoo. This asset allows for the evaluation of AI agents interacting with a real Ubuntu desktop environment.
Outcomes
What it gets done
Wrap OSWorld's Inspect-native implementation for promptfoo.
Run OSWorld tasks through promptfoo's evaluation framework.
Assess agent performance against VM state and task-specific checks.
Benchmark multimodal agents in open-ended computer tasks.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/pfoo-integration-inspect-osworld | bash Capabilities
What this chain does
Runs system commands and automates desktop tasks.
Creates unit, integration, or end-to-end test cases.
Traces errors to their root cause and suggests fixes.
Overview
Integration Inspect Osworld
What it does
This prompt chain integrates the OSWorld multimodal computer-use benchmark into promptfoo by wrapping the Inspect-native implementation from inspect_evals/osworld. It runs benchmark tasks where AI agents observe an Ubuntu desktop through screenshots, perform actions using mouse and keyboard tools, and are evaluated against VM state using task-specific checks. The benchmark tests open-ended tasks in real computer environments as defined in the OSWorld research paper.
How it connects
Use this when you need to evaluate multimodal AI agents on their ability to perform real computer tasks in Ubuntu desktop environments. It's ideal for testing agents that must interpret visual desktop state and execute complex sequences of mouse and keyboard actions, providing standardized benchmark scores for computer-use capabilities.
Source README
This example runs a real OSWorld task through promptfoo by wrapping the Inspect-native implementation in inspect_evals/osworld. OSWorld is a multimodal computer-use benchmark where an agent observes an Ubuntu desktop via screenshots, acts with mouse and keyboard tools, and is graded by task-specific checks against VM state. The benchmark is described in OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.