Domain-level metacognitive monitoring varies widely across 33 frontier LLMs
A recent study published on arXiv (2605.06673) investigated metacognitive monitoring across 33 advanced LLMs from eight different model families. Researchers evaluated 1,500 MMLU items, distributing 250 items per domain across six distinct areas. The analysis calculated the Type-2 AUROC for each model-domain combination based on verbalized confidence scores (ranging from 0 to 100), resulting in 47,151 data points. Findings indicated that all models demonstrating above-chance aggregate monitoring had significant variation at the domain level. The Applied/Professional knowledge domain was the most easily monitored (mean AUROC = .742, ranking in the top-2 for 21 out of 33 models), whereas Formal Reasoning and Natural Science were the most challenging (one of the two ranked in the bottom-2 for 27 out of 33 models). The middle three domains showed no statistical distinction (Kendall's W = .164). A coherence analysis at the subject level (within-domain similarity ratio = 0.95) validated the six-domain grouping as a useful benchmark taxonomy rather than an established latent construct.
Key facts
- 33 frontier LLMs from eight model families were tested
- 1,500 MMLU items were administered (250 per domain)
- Six domains: Applied/Professional, Formal Reasoning, Natural Science, and three middle domains
- Total observations: 47,151
- Applied/Professional knowledge had mean AUROC = .742
- Applied/Professional ranked top-2 in 21 of 33 models
- Formal Reasoning or Natural Science ranked bottom-2 in 27 of 33 models
- Within-domain similarity ratio = 0.95
Entities
Institutions
- arXiv