LLM Math Reasoning: Accuracy High, Strategy Diversity Low
A recent study published on arXiv (2605.09292) presents a framework for evaluating large language models in terms of their mathematical reasoning capabilities. Analyzing 80 problems from AMC 10/12 and AIME alongside 217 strategy families derived from AoPS, the researchers discovered a separation between the accuracy of answers and the variety of strategies employed. When prompted with single-solution tasks, models demonstrated an accuracy rate of 95-100%. However, when faced with multiple-strategy prompts, they exhibited a significant reduction in strategy recovery compared to humans. Gemini produced 184 unique valid strategies, while DeepSeek generated 152, GPT 151, and Claude 110, with the most significant discrepancies noted in Geometry and Number Theory. Additionally, the models created 50 benchmark-novel valid strategies.
Key facts
- Study evaluates LLM mathematical reasoning strategy diversity beyond accuracy.
- Framework uses 80 AMC 10/12 and AIME problems with 217 AoPS reference strategy families.
- Dual-AI coding with human adjudication annotates strategy identity, validity, and correctness.
- Under single-solution prompt, models achieve 95-100% accuracy.
- Under multiple-strategy prompt, models recover fewer strategies than human reference set.
- Gemini leads with 184 distinct valid strategies, followed by DeepSeek (152), GPT (151), Claude (110).
- Largest strategy gaps in Geometry and Number Theory.
- Models collectively produce 50 benchmark-novel valid strategies.
Entities
Institutions
- arXiv