Prompt Chain

Test OSWorld Multimodal Agent Integration

Runs OSWorld multimodal computer-use benchmark tasks through promptfoo, testing agents that observe Ubuntu desktop screenshots and act with mouse and keyboard

Works with github

75
Spark score
out of 100
Updated 17 days ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Integrate and test the OSWorld multimodal benchmark using promptfoo. This asset allows for the evaluation of AI agents interacting with a real Ubuntu desktop environment.

Outcomes

What it gets done

01

Wrap OSWorld's Inspect-native implementation for promptfoo.

02

Run OSWorld tasks through promptfoo's evaluation framework.

03

Assess agent performance against VM state and task-specific checks.

04

Benchmark multimodal agents in open-ended computer tasks.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-integration-inspect-osworld | bash

Capabilities

What this chain does

Automate the OS

Runs system commands and automates desktop tasks.

Write tests

Creates unit, integration, or end-to-end test cases.

Debug

Traces errors to their root cause and suggests fixes.

Overview

Integration Inspect Osworld

What it does

This prompt chain integrates the OSWorld multimodal computer-use benchmark into promptfoo by wrapping the Inspect-native implementation from inspect_evals/osworld. It runs benchmark tasks where AI agents observe an Ubuntu desktop through screenshots, perform actions using mouse and keyboard tools, and are evaluated against VM state using task-specific checks. The benchmark tests open-ended tasks in real computer environments as defined in the OSWorld research paper.

How it connects

Use this when you need to evaluate multimodal AI agents on their ability to perform real computer tasks in Ubuntu desktop environments. It's ideal for testing agents that must interpret visual desktop state and execute complex sequences of mouse and keyboard actions, providing standardized benchmark scores for computer-use capabilities.

Source README

This example runs a real OSWorld task through promptfoo by wrapping the Inspect-native implementation in inspect_evals/osworld. OSWorld is a multimodal computer-use benchmark where an agent observes an Ubuntu desktop via screenshots, acts with mouse and keyboard tools, and is graded by task-specific checks against VM state. The benchmark is described in OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.