MEDLEY-BENCH AI Benchmark Reveals Metacognition Evaluation-Control Dissociation in 35 Models
A new benchmark called MEDLEY-BENCH assesses behavioral metacognition in AI systems, specifically examining how models monitor and regulate their own reasoning processes. The benchmark evaluates 35 models from 12 different families across 130 ambiguous instances spanning five domains. It distinguishes between independent reasoning, private self-revision, and socially influenced revision when models encounter genuine disagreement. Two complementary scoring metrics are reported: the Medley Metacognition Score (MMS), which aggregates reflective updating, social robustness, and epistemic articulation in a tier-based system, and the Medley Ability Score (MAS), derived from four distinct metacognitive sub-abilities. Results demonstrate a clear dissociation between evaluation and control capabilities—evaluation ability consistently improves with increased model size within families, while control does not show similar scaling. A follow-up progressive adversarial analysis involving 11 models identified two distinct behavioral profiles: models that revise primarily in response to social influence and those that rely more on private self-revision. The research highlights that metacognition remains under-evaluated in current AI benchmarking practices, despite its importance for advanced reasoning systems. The benchmark was announced on arXiv with the identifier 2604.16009v1, marking a significant contribution to the field of AI evaluation methodologies.
Key facts
- MEDLEY-BENCH is a new benchmark for evaluating behavioral metacognition in AI
- It assesses 35 models from 12 families across 130 ambiguous instances in five domains
- The benchmark separates independent reasoning, private self-revision, and socially influenced revision
- Two scores are reported: Medley Metacognition Score (MMS) and Medley Ability Score (MAS)
- Evaluation ability increases with model size within families, but control does not
- A follow-up analysis of 11 models revealed two distinct behavioral revision profiles
- Metacognition remains under-evaluated in current AI benchmarking practices
- The research was announced on arXiv with identifier 2604.16009v1
Entities
Institutions
- arXiv