GSM-Symbolic Benchmark's Statistical Flaws Questioned
A new critique challenges the GSM-Symbolic benchmark's conclusion that LLMs lack reasoning. Re-evaluating 20 open-weight models with Generalized Linear Mixed Models, researchers found only half showed significant performance drops. The main dataset also has a shifted distribution of larger integers, contradicting original claims.
Key facts
- GSM-Symbolic benchmark reported performance drops across 25 LLMs
- Re-evaluation used Generalized Linear Mixed Models with per-question random effects
- Only half of 20 open-weight models showed statistically significant changes
- Main GSM-Symbolic dataset has shifted distribution of larger integers (K-S statistic = 0.12, p < 0.001)
- Controlling for large number effect accounts for significance in roughly half remaining cases
Entities
—