ARTFEED — Contemporary Art Intelligence

GSM-Symbolic Benchmark's Statistical Flaws Questioned

other · 2026-05-28

A new critique challenges the GSM-Symbolic benchmark's conclusion that LLMs lack reasoning. Re-evaluating 20 open-weight models with Generalized Linear Mixed Models, researchers found only half showed significant performance drops. The main dataset also has a shifted distribution of larger integers, contradicting original claims.

Key facts

  • GSM-Symbolic benchmark reported performance drops across 25 LLMs
  • Re-evaluation used Generalized Linear Mixed Models with per-question random effects
  • Only half of 20 open-weight models showed statistically significant changes
  • Main GSM-Symbolic dataset has shifted distribution of larger integers (K-S statistic = 0.12, p < 0.001)
  • Controlling for large number effect accounts for significance in roughly half remaining cases

Entities

Sources