GeoRepEval Framework Assesses LLM Robustness Across Geometry Problem Representations

ai-technology · 2026-04-22

Researchers have created a new evaluation framework named GeoRepEval to assess the performance of large language models (LLMs) on geometry problems presented in various mathematical formats. They observed that current benchmarks often evaluate LLMs with fixed problem structures, which overlooks failures arising from different representations. GeoRepEval tests eleven language models against 158 specially selected geometry problems, examining them through parallel formats such as Euclidean, coordinate, and vector representations. The framework focuses on three main criteria: correctness, invariance, and consistency at the problem level, utilizing various statistical techniques. Among these is the Invariance@3 metric, which breaks down accuracy into robust and fragile elements. The findings reveal significant variability in LLM performance based on problem expression. This research, documented in paper arXiv:2604.16421v1, fills a critical gap in understanding model robustness in mathematical contexts.

Key facts

GeoRepEval evaluates LLM robustness across geometry problem representations
Framework measures correctness, invariance, and consistency at problem level
Eleven large language models were assessed on 158 curated geometry problems
Existing benchmarks assume representation invariance with fixed formats
Problems can be expressed in Euclidean, coordinate, or vector forms
Invariance@3 metric decomposes accuracy into robust and fragile components
Statistical methods include bootstrap confidence intervals and McNemar tests
Paper identifier is arXiv:2604.16421v1 with cross-disciplinary announcement

Entities

—

Sources

arXiv cs.AI — 2026-04-21