ARC-AGI-3 Benchmark Challenges AI with Abstract Interactive Tasks
The newly launched ARC-AGI-3 benchmark aims to assess agentic intelligence in advanced AI systems. In contrast to its earlier versions, ARC-AGI-1 and 2, this benchmark emphasizes fluid adaptive efficiency through innovative, abstract, turn-based scenarios. Agents are required to explore, deduce objectives, formulate internal models of environmental dynamics, and devise effective action plans without any explicit guidance, steering clear of language and external knowledge. Utilizing Core Knowledge priors, the environments are calibrated for difficulty through thorough testing with human participants. As of March 2026, results indicate that humans can solve all environments, whereas frontier AI systems achieve less than 1%. The accompanying paper outlines the benchmark's design, an efficiency-focused scoring system based on human actions, and the procedures for constructing, validating, and calibrating these environments, aiming to enhance the exploration of agentic intelligence by presenting a challenging, human-calibrated test that underscores the limitations of current AI in adaptive problem-solving.
Key facts
- ARC-AGI-3 is an interactive benchmark for studying agentic intelligence
- It uses novel, abstract, turn-based environments without explicit instructions
- Agents must explore, infer goals, build internal models, and plan action sequences
- The benchmark avoids language and external knowledge, focusing on Core Knowledge priors
- Environments are difficulty-calibrated through extensive human testing
- As of March 2026, humans solve 100% of environments, while frontier AI systems score below 1%
- The paper presents the benchmark design and an efficiency-based scoring framework
- Methodology includes construction, validation, and calibration of environments
Entities
Institutions
- arXiv