ARTFEED — Contemporary Art Intelligence

GeoRepEval Framework Assesses LLM Robustness Across Geometry Problem Representations

ai-technology · 2026-04-22

Researchers have created a new evaluation framework named GeoRepEval to assess the performance of large language models (LLMs) on geometry problems presented in various mathematical formats. They observed that current benchmarks often evaluate LLMs with fixed problem structures, which overlooks failures arising from different representations. GeoRepEval tests eleven language models against 158 specially selected geometry problems, examining them through parallel formats such as Euclidean, coordinate, and vector representations. The framework focuses on three main criteria: correctness, invariance, and consistency at the problem level, utilizing various statistical techniques. Among these is the Invariance@3 metric, which breaks down accuracy into robust and fragile elements. The findings reveal significant variability in LLM performance based on problem expression. This research, documented in paper arXiv:2604.16421v1, fills a critical gap in understanding model robustness in mathematical contexts.

Key facts

  • GeoRepEval evaluates LLM robustness across geometry problem representations
  • Framework measures correctness, invariance, and consistency at problem level
  • Eleven large language models were assessed on 158 curated geometry problems
  • Existing benchmarks assume representation invariance with fixed formats
  • Problems can be expressed in Euclidean, coordinate, or vector forms
  • Invariance@3 metric decomposes accuracy into robust and fragile components
  • Statistical methods include bootstrap confidence intervals and McNemar tests
  • Paper identifier is arXiv:2604.16421v1 with cross-disciplinary announcement

Entities

Sources