Prompt Chain

Test OSWorld Multimodal Agent Integration

Name: Test OSWorld Multimodal Agent Integration
Availability: OnlineOnly
Author: Promptfoo

Runs OSWorld multimodal computer-use benchmark tasks through promptfoo, testing agents that observe Ubuntu desktop screenshots and act with mouse and keyboard

Copy chain

Works with github

Promptfoo

Maintainer?

Spark score

out of 100

Updated 17 days ago

Version 1.0.0

Models

gpt 4o

Add to Favorites

Why it matters

Integrate and test the OSWorld multimodal benchmark using promptfoo. This asset allows for the evaluation of AI agents interacting with a real Ubuntu desktop environment.

Outcomes

What it gets done

Wrap OSWorld's Inspect-native implementation for promptfoo.

Run OSWorld tasks through promptfoo's evaluation framework.

Assess agent performance against VM state and task-specific checks.

Benchmark multimodal agents in open-ended computer tasks.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/pfoo-integration-inspect-osworld | bash

Capabilities

What this chain does

Automate the OS

Runs system commands and automates desktop tasks.

Write tests

Creates unit, integration, or end-to-end test cases.

Debug

Traces errors to their root cause and suggests fixes.

Overview

Integration Inspect Osworld

What it does

This prompt chain integrates the OSWorld multimodal computer-use benchmark into promptfoo by wrapping the Inspect-native implementation from inspect_evals/osworld. It runs benchmark tasks where AI agents observe an Ubuntu desktop through screenshots, perform actions using mouse and keyboard tools, and are evaluated against VM state using task-specific checks. The benchmark tests open-ended tasks in real computer environments as defined in the OSWorld research paper.

How it connects

Use this when you need to evaluate multimodal AI agents on their ability to perform real computer tasks in Ubuntu desktop environments. It's ideal for testing agents that must interpret visual desktop state and execute complex sequences of mouse and keyboard actions, providing standardized benchmark scores for computer-use capabilities.

Source README

This example runs a real OSWorld task through promptfoo by wrapping the Inspect-native implementation in inspect_evals/osworld. OSWorld is a multimodal computer-use benchmark where an agent observes an Ubuntu desktop via screenshots, acts with mouse and keyboard tools, and is graded by task-specific checks against VM state. The benchmark is described in OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.

Discussion