MIRROR Benchmark Reveals LLMs Fail at Self-Prediction

ai-technology · 2026-04-24

Researchers introduced MIRROR, a benchmark evaluating metacognitive calibration in large language models across eight experiments and four metacognitive levels. Testing 16 models from 8 labs over 250,000 instances via five behavioral channels, they found two key phenomena: compositional self-prediction fails universally, with Compositional Calibration Error ranging from 0.500 to 0.943 on the original 15-model set and 0.434 to 0.758 on a balanced 16-model expansion, indicating models cannot predict multi-domain task performance. Additionally, models show above-chance but imperfect domain-specific self-knowledge, yet systematic failures persist. The study has direct implications for agentic deployment.

Key facts

MIRROR benchmark evaluates metacognitive calibration in LLMs
Eight experiments across four metacognitive levels
16 models from 8 labs tested
Approximately 250,000 evaluation instances
Five independent behavioral measurement channels
Compositional Calibration Error ranges from 0.500 to 0.943 on original 15-model set
Balanced 16-model expansion shows CCE from 0.434 to 0.758
Models exhibit above-chance but imperfect domain-specific self-knowledge

MIRROR Benchmark Reveals LLMs Fail at Self-Prediction

Key facts

Entities

Institutions

Sources