MGSM-Pro: New Benchmark for Multilingual Math Reasoning in LLMs
Researchers introduced MGSM-Pro, a multilingual mathematical reasoning benchmark extending the MGSM dataset with GSM-Symbolic's instantiation approach. The dataset provides five variations per question by altering names, digits, and irrelevant context. Evaluations across nine languages reveal significant performance drops for low-resource languages on digit variations. Model robustness in high-resource languages does not transfer to low-resource ones. Proprietary models like Gemini 2.5 Flash and GPT-4.1 were tested.
Key facts
- MGSM-Pro extends MGSM with GSM-Symbolic approach
- Five instantiations per question by varying names, digits, and irrelevant context
- Evaluated across nine languages
- Low-resource languages suffer large performance drops on digit variations
- Robustness in high-resource languages does not transfer to low-resource languages
- Proprietary models tested include Gemini 2.5 Flash and GPT-4.1
- Published on arXiv with ID 2601.21225
- Announce type: replace-cross
Entities
Institutions
- arXiv