EvoCode-Bench: New Benchmark Tests Coding Agents on Multi-Turn Tasks
EvoCode-Bench has been launched by researchers as a new benchmark aimed at assessing coding agents during multi-turn iterative interactions. Unlike conventional benchmarks that focus on a single specification with a final assessment, EvoCode-Bench evaluates the ability of agents to sustain a functional codebase amid changing requirements. This benchmark features 26 stateful coding challenges and 227 evaluation rounds, where each task allows the agent's workspace to persist for 5-15 rounds. Requirements are conveyed through observable actions, and cumulative executable tests assess both new and existing requirements. The study analyzed 13 coding agents using two metrics: MT@4, a score based on four attempts, and SR, a score from a previously completed reference state. Findings indicate that SR surpasses MT@4 by 22-40 points for most agents, altering their rankings. The agent with the highest SR (78.9) only ranks third in persistent execution (44.0 MT@4), and even the top agents achieve around 50% success in multi-turn tasks.
Key facts
- EvoCode-Bench is a benchmark for coding agents in multi-turn iterative interactions.
- It includes 26 stateful coding tasks and 227 evaluated rounds.
- Each task preserves the agent's workspace for 5-15 rounds.
- Requirements are stated through observable behavior.
- Cumulative executable tests check new and prior requirements.
- 13 coding agents were evaluated using MT@4 and SR metrics.
- SR exceeds MT@4 by 22-40 points for most agents.
- The highest-SR agent (78.9) ranks third in persistent execution (44.0 MT@4).
Entities
Institutions
- arXiv