SciCrafter Benchmark Reveals AI Agents Plateau at 26% in Minecraft Discovery-to-Application Tasks
A new standard known as SciCrafter, developed within Minecraft, assesses AI agents on the loop from discovery to application by having them create redstone circuits that activate lamps in designated patterns. The research examines leading models such as GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 within a general-purpose code agent framework, revealing that all reach a success rate of about 26%. This benchmark highlights the divide between scientific discovery and practical engineering through parameterized tasks, where increasing target parameters heightens construction difficulty, necessitating authentic discovery instead of relying on memorized answers. The study breaks down the loop into four capacities, focusing on knowledge gap identification to analyze failures. This research is available on arXiv with the identifier 2604.24697.
Key facts
- SciCrafter is a Minecraft-based benchmark for the discovery-to-application loop
- Agents must ignite lamps in specified patterns using redstone circuits
- Scaling target parameters increases construction complexity and required knowledge
- Frontier models evaluated include GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5
- All models plateau at approximately 26% success rate
- The benchmark operationalizes the discovery-to-application loop
- The research decomposes the loop into four capacities
- Published on arXiv under identifier 2604.24697
Entities
Institutions
- arXiv