GSM-SEM: A New Benchmark for Mathematical Reasoning Robustness

other · 2026-05-11

Researchers have introduced GSM-SEM, a stochastic framework for generating semantically diverse variants of mathematical reasoning benchmarks like GSM8K. Traditional benchmarks suffer from memorization due to fixed test sets, and existing robustness variants apply only surface-level perturbations. GSM-SEM modifies entities, attributes, and relationships in problem statements, altering underlying facts to require models to recompute solutions under new conditions. It preserves original calculations and difficulty while generating fresh variants on each run without re-annotation. This framework aims to provide a more reliable measure of true reasoning capability by preventing overfitting to static datasets.

Key facts

GSM-SEM is a reusable and stochastic framework for generating semantically diverse benchmark variants.
It perturbs problem statements by modifying entities, attributes, and/or relationships.
The framework frequently alters underlying facts, requiring models to recompute solutions.
Generation is constrained to preserve original calculations/answer and approximate problem difficulty.
GSM-SEM generates fresh variants on each run without requiring re-annotation.
It addresses memorization issues in popular benchmarks like GSM8K.
Existing robustness variants apply only surface-level perturbations such as paraphrases or number swaps.
The framework introduces substantially higher semantic variance than prior approaches.

GSM-SEM: A New Benchmark for Mathematical Reasoning Robustness

Key facts

Entities

Institutions

Sources