Open-World Evaluations: A New Approach to Measuring Frontier AI

ai-technology · 2026-05-22

A recent study published on arXiv suggests that open-world evaluations should be used alongside conventional benchmark assessments to gauge advanced AI capabilities. The researchers contend that traditional benchmarks can misrepresent actual performance, as they tend to prioritize tasks that are easily defined, automatically scored, simple to optimize, and inexpensive. In contrast, open-world evaluations focus on complex, real-world tasks assessed through qualitative analysis with limited samples. The paper reviews recent instances, highlighting their advantages and drawbacks, and introduces CRUX (Collaborative Research for Updating AI eXpectations), a project aimed at regularly conducting these evaluations. In an initial test, an AI agent successfully created and published a basic iOS app on the Apple App Store, requiring just one unnecessary manual intervention.

Key facts

arXiv paper 2605.20520 proposes open-world evaluations for frontier AI.
Benchmarks can overstate or understate deployed capability.
Open-world evaluations are long-horizon, messy, real-world tasks.
Assessment uses small-sample qualitative analysis, not automation.
CRUX project will conduct open-world evaluations regularly.
First CRUX instance: AI agent develops and publishes an iOS app.
Agent completed task with only one avoidable manual intervention.
Paper surveys recent open-world evaluations and their strengths/limitations.

Open-World Evaluations: A New Approach to Measuring Frontier AI

Key facts

Entities

Institutions

Sources