FormulaCode Benchmark Tests LLM Agents on Real-World Code Optimization
Researchers introduced FormulaCode, a benchmark for evaluating large language model (LLM) coding agents on repository-scale optimization tasks. The benchmark includes 957 performance bottlenecks from scientific Python repositories on GitHub, each with expert-authored patches and an average of 264.6 community-maintained performance workloads per task. FormulaCode uses fine-grained, multi-objective metrics to assess holistic optimization behavior under realistic constraints, addressing limitations of existing benchmarks that rely on synthetic tasks or binary correctness signals. Initial evaluations show the challenges of repository-scale, multi-objective optimization for current LLM agents.
Key facts
- FormulaCode is a benchmark for evaluating LLM coding agents on large, real-world codebases.
- It comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub.
- Each bottleneck is paired with expert-authored patches.
- There are on average 264.6 community-maintained performance workloads per task.
- The benchmark uses fine-grained, multi-objective performance metrics.
- Existing code benchmarks rely on synthetic tasks or binary correctness signals.
- FormulaCode evaluates holistic optimization behavior under realistic constraints.
- Initial evaluations reveal challenges in repository-scale, multi-objective optimization.
Entities
Institutions
- arXiv
- GitHub