ARTFEED — Contemporary Art Intelligence

LLMs' Math Reasoning Robustness Tested with Code Execution

ai-technology · 2026-05-27

A new study systematically evaluates how Large Language Models (LLMs) handle variations in math problems, comparing pure reasoning, single-shot code execution, and iterative code execution. Using 1,000 problems from the GSM-Symbolic dataset, researchers tested Claude Haiku 4.5 on paired original and modified problems. Chain-of-thought (CoT) prompting proved most robust, with an accuracy drop of only 1.3 percentage points and 1.8% of problems breaking. The study, published on arXiv (2605.26414), challenges the assumption that code execution methods improve robustness against simple changes like different names or numbers.

Key facts

  • Study evaluates three approaches on 1,000 problems from GSM-Symbolic dataset
  • Approaches: chain-of-thought (CoT), Program-Aided Language models (PAL), Step-by-Step Coding (SBSC)
  • All models tested on Claude Haiku 4.5
  • CoT was most robust with accuracy drop of 1.3 percentage points
  • 1.8% of problems broke under CoT
  • Code execution methods did not improve robustness as expected
  • Published on arXiv with ID 2605.26414
  • Problems modified with simple changes like different names or numbers

Entities

Institutions

  • arXiv

Sources