LLMs' Math Reasoning Robustness Tested with Code Execution
A new study systematically evaluates how Large Language Models (LLMs) handle variations in math problems, comparing pure reasoning, single-shot code execution, and iterative code execution. Using 1,000 problems from the GSM-Symbolic dataset, researchers tested Claude Haiku 4.5 on paired original and modified problems. Chain-of-thought (CoT) prompting proved most robust, with an accuracy drop of only 1.3 percentage points and 1.8% of problems breaking. The study, published on arXiv (2605.26414), challenges the assumption that code execution methods improve robustness against simple changes like different names or numbers.
Key facts
- Study evaluates three approaches on 1,000 problems from GSM-Symbolic dataset
- Approaches: chain-of-thought (CoT), Program-Aided Language models (PAL), Step-by-Step Coding (SBSC)
- All models tested on Claude Haiku 4.5
- CoT was most robust with accuracy drop of 1.3 percentage points
- 1.8% of problems broke under CoT
- Code execution methods did not improve robustness as expected
- Published on arXiv with ID 2605.26414
- Problems modified with simple changes like different names or numbers
Entities
Institutions
- arXiv