GSM-Symbolic Benchmark's Statistical Flaws Questioned

other · 2026-05-28

A new critique challenges the GSM-Symbolic benchmark's conclusion that LLMs lack reasoning. Re-evaluating 20 open-weight models with Generalized Linear Mixed Models, researchers found only half showed significant performance drops. The main dataset also has a shifted distribution of larger integers, contradicting original claims.

Key facts

GSM-Symbolic benchmark reported performance drops across 25 LLMs
Re-evaluation used Generalized Linear Mixed Models with per-question random effects
Only half of 20 open-weight models showed statistically significant changes
Main GSM-Symbolic dataset has shifted distribution of larger integers (K-S statistic = 0.12, p < 0.001)
Controlling for large number effect accounts for significance in roughly half remaining cases

Entities

—

Sources

arXiv cs.AI — 2026-05-28