MIRROR Benchmark Reveals LLMs Fail at Self-Prediction
Researchers introduced MIRROR, a benchmark evaluating metacognitive calibration in large language models across eight experiments and four metacognitive levels. Testing 16 models from 8 labs over 250,000 instances via five behavioral channels, they found two key phenomena: compositional self-prediction fails universally, with Compositional Calibration Error ranging from 0.500 to 0.943 on the original 15-model set and 0.434 to 0.758 on a balanced 16-model expansion, indicating models cannot predict multi-domain task performance. Additionally, models show above-chance but imperfect domain-specific self-knowledge, yet systematic failures persist. The study has direct implications for agentic deployment.
Key facts
- MIRROR benchmark evaluates metacognitive calibration in LLMs
- Eight experiments across four metacognitive levels
- 16 models from 8 labs tested
- Approximately 250,000 evaluation instances
- Five independent behavioral measurement channels
- Compositional Calibration Error ranges from 0.500 to 0.943 on original 15-model set
- Balanced 16-model expansion shows CCE from 0.434 to 0.758
- Models exhibit above-chance but imperfect domain-specific self-knowledge
Entities
Institutions
- arXiv