ProgramBench: Benchmarking AI for Holistic Software Development
A new benchmark called ProgramBench has been launched by researchers to assess the capabilities of language model-based software engineering agents in creating complete programs from the ground up. Unlike traditional benchmarks that target specific tasks like bug fixing or implementing individual features, ProgramBench evaluates comprehensive software architecture decisions. Agents receive only a program and its documentation, requiring them to design and build a codebase that aligns with the behavior of a reference executable. Through agent-driven fuzzing, end-to-end behavioral tests are produced, enabling evaluation without dictating the implementation structure. The benchmark comprises 200 tasks, including compact CLI tools and popular software like FFmpeg, SQLite, and the Linux kernel. This initiative responds to the increasing reliance on language models for the long-term development and maintenance of codebases with minimal human intervention.
Key facts
- ProgramBench measures ability of software engineering agents to develop software holistically.
- Agents must architect and implement a codebase matching reference executable's behavior.
- End-to-end behavioral tests are generated via agent-driven fuzzing.
- Benchmark includes 200 tasks from CLI tools to FFmpeg, SQLite, and Linux kernel.
- Existing benchmarks focus on limited tasks like fixing a single bug.
- Language models are increasingly used to seed and maintain codebases autonomously.
Entities
—