RoadmapBench: New Benchmark for Long-Horizon Coding Agents
RoadmapBench is an innovative benchmark aimed at assessing AI coding agents on complex software development projects involving multiple targets over extended periods. In contrast to current benchmarks that concentrate solely on fixing individual bugs from Python repositories, RoadmapBench encompasses 115 tasks based on genuine open-source version upgrades from 17 repositories and five programming languages. Each task necessitates that an agent incorporates features from a specified version, with an average modification spanning 3,700 lines across 51 files. This benchmark was evaluated using thirteen advanced models, with Claude-Opus-4.7 achieving the highest performance. It fills a significant void in the evaluation of agents for extensive, real-world development scenarios.
Key facts
- RoadmapBench includes 115 long-horizon coding tasks.
- Tasks are based on real open-source version upgrades.
- Covers 17 repositories and 5 programming languages.
- Median modification is 3,700 lines across 51 files.
- Evaluated on thirteen frontier models.
- Claude-Opus-4.7 achieved the strongest performance.
- Existing benchmarks focus on single-issue bug fixes from Python repositories.
- RoadmapBench addresses the gap in evaluating long-horizon development.
Entities
—