ARTFEED — Contemporary Art Intelligence

MIRROR Benchmark Reveals LLMs Fail at Self-Prediction

ai-technology · 2026-04-24

Researchers introduced MIRROR, a benchmark evaluating metacognitive calibration in large language models across eight experiments and four metacognitive levels. Testing 16 models from 8 labs over 250,000 instances via five behavioral channels, they found two key phenomena: compositional self-prediction fails universally, with Compositional Calibration Error ranging from 0.500 to 0.943 on the original 15-model set and 0.434 to 0.758 on a balanced 16-model expansion, indicating models cannot predict multi-domain task performance. Additionally, models show above-chance but imperfect domain-specific self-knowledge, yet systematic failures persist. The study has direct implications for agentic deployment.

Key facts

  • MIRROR benchmark evaluates metacognitive calibration in LLMs
  • Eight experiments across four metacognitive levels
  • 16 models from 8 labs tested
  • Approximately 250,000 evaluation instances
  • Five independent behavioral measurement channels
  • Compositional Calibration Error ranges from 0.500 to 0.943 on original 15-model set
  • Balanced 16-model expansion shows CCE from 0.434 to 0.758
  • Models exhibit above-chance but imperfect domain-specific self-knowledge

Entities

Institutions

  • arXiv

Sources