GeoLaux Dataset Evaluates MLLMs' Geometry Problem-Solving with Long-Step Reasoning and Auxiliary Lines
A new evaluation standard named GeoLaux has been launched to measure the abilities of Multimodal Large Language Models in tackling geometry challenges that involve complex reasoning and the creation of auxiliary lines. This dataset comprises 2,186 problems related to calculations and proofs, averaging 6.51 steps per solution, with a maximum of 24 steps. Remarkably, 41.8% of these challenges require auxiliary lines, posing a significant hurdle for existing models. Researchers assessed 23 top MLLMs across five criteria, uncovering notable performance disparities. Models performed considerably worse on problems requiring many steps, with 18 models showing declines greater than 50%. Published as arXiv:2508.06226v2, this study emphasizes the limitations in current MLLMs regarding diagram interpretation and knowledge application for intricate geometric reasoning, highlighting the necessity for enhanced model designs.
Key facts
- GeoLaux is a benchmark dataset for evaluating MLLMs on geometry problems
- Contains 2,186 calculation and proof problems
- Average solution length is 6.51 steps with maximum of 24 steps
- 41.8% of problems require auxiliary line construction
- Evaluated 23 leading Multimodal Large Language Models
- 18 models showed performance drops over 50% on long-step problems
- Published as arXiv:2508.06226v2
- Addresses lack of fine-grained evaluation for long-step geometry problems
Entities
—