GeoLaux Dataset Evaluates MLLMs' Geometry Problem-Solving with Long-Step Reasoning and Auxiliary Lines

ai-technology · 2026-04-22

A new evaluation standard named GeoLaux has been launched to measure the abilities of Multimodal Large Language Models in tackling geometry challenges that involve complex reasoning and the creation of auxiliary lines. This dataset comprises 2,186 problems related to calculations and proofs, averaging 6.51 steps per solution, with a maximum of 24 steps. Remarkably, 41.8% of these challenges require auxiliary lines, posing a significant hurdle for existing models. Researchers assessed 23 top MLLMs across five criteria, uncovering notable performance disparities. Models performed considerably worse on problems requiring many steps, with 18 models showing declines greater than 50%. Published as arXiv:2508.06226v2, this study emphasizes the limitations in current MLLMs regarding diagram interpretation and knowledge application for intricate geometric reasoning, highlighting the necessity for enhanced model designs.

Key facts

GeoLaux is a benchmark dataset for evaluating MLLMs on geometry problems
Contains 2,186 calculation and proof problems
Average solution length is 6.51 steps with maximum of 24 steps
41.8% of problems require auxiliary line construction
Evaluated 23 leading Multimodal Large Language Models
18 models showed performance drops over 50% on long-step problems
Published as arXiv:2508.06226v2
Addresses lack of fine-grained evaluation for long-step geometry problems

Entities

—

Sources

arXiv cs.AI — 2026-04-22