SciCrafter Benchmark Reveals AI Agents Plateau at 26% in Minecraft Discovery-to-Application Tasks

ai-technology · 2026-04-29

A new standard known as SciCrafter, developed within Minecraft, assesses AI agents on the loop from discovery to application by having them create redstone circuits that activate lamps in designated patterns. The research examines leading models such as GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 within a general-purpose code agent framework, revealing that all reach a success rate of about 26%. This benchmark highlights the divide between scientific discovery and practical engineering through parameterized tasks, where increasing target parameters heightens construction difficulty, necessitating authentic discovery instead of relying on memorized answers. The study breaks down the loop into four capacities, focusing on knowledge gap identification to analyze failures. This research is available on arXiv with the identifier 2604.24697.

Key facts

SciCrafter is a Minecraft-based benchmark for the discovery-to-application loop
Agents must ignite lamps in specified patterns using redstone circuits
Scaling target parameters increases construction complexity and required knowledge
Frontier models evaluated include GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5
All models plateau at approximately 26% success rate
The benchmark operationalizes the discovery-to-application loop
The research decomposes the loop into four capacities
Published on arXiv under identifier 2604.24697

SciCrafter Benchmark Reveals AI Agents Plateau at 26% in Minecraft Discovery-to-Application Tasks

Key facts

Entities

Institutions

Sources