LLMs' Math Reasoning Robustness Tested with Code Execution

ai-technology · 2026-05-27

A new study systematically evaluates how Large Language Models (LLMs) handle variations in math problems, comparing pure reasoning, single-shot code execution, and iterative code execution. Using 1,000 problems from the GSM-Symbolic dataset, researchers tested Claude Haiku 4.5 on paired original and modified problems. Chain-of-thought (CoT) prompting proved most robust, with an accuracy drop of only 1.3 percentage points and 1.8% of problems breaking. The study, published on arXiv (2605.26414), challenges the assumption that code execution methods improve robustness against simple changes like different names or numbers.

Key facts

Study evaluates three approaches on 1,000 problems from GSM-Symbolic dataset
Approaches: chain-of-thought (CoT), Program-Aided Language models (PAL), Step-by-Step Coding (SBSC)
All models tested on Claude Haiku 4.5
CoT was most robust with accuracy drop of 1.3 percentage points
1.8% of problems broke under CoT
Code execution methods did not improve robustness as expected
Published on arXiv with ID 2605.26414
Problems modified with simple changes like different names or numbers

LLMs' Math Reasoning Robustness Tested with Code Execution

Key facts

Entities

Institutions

Sources