CodeGolf Bench Tests LLMs on Concise Code in 60 Languages
Researchers have introduced CodeGolf Bench, a benchmark for evaluating large language models' ability to generate concise code across 60 programming languages. Based on the code golf competition format, which rewards minimal character or byte solutions, it measures LLMs' efficiency in code generation. Unlike fixed-set benchmarks, CodeGolf Bench uses the code.golf platform for dynamic problems and live human baselines. Testing nine LLMs on Python and C++ tasks, reasoning models outperformed non-reasoning ones, achieving a best average percentile of 70.97%. The gap was larger in C++, underscoring reasoning's importance for syntax-strict languages. Non-reasoning models struggled with optimization in both languages.
Key facts
- CodeGolf Bench is a new benchmark for LLM concise code generation.
- It covers 60 programming languages.
- Based on code golf, a competition for minimal character solutions.
- Uses code.golf platform for dynamic problems and human baselines.
- Evaluated nine LLMs on Python and C++.
- Reasoning models achieved best average percentile of 70.97%.
- Performance gap more pronounced in C++.
- Non-reasoning models struggled with efficiency optimization.
Entities
—