New Framework Measures Reasoning Quality in LLMs Beyond Accuracy
A new multi-dimensional behavioral framework has been developed by researchers to assess the quality of reasoning in large language models (LLMs), moving past the reliance on traditional accuracy metrics. This framework encompasses six dimensions: Correctness, Consistency, Robustness, Logical Coherence, Efficiency, and Stability. Testing on seven LLMs using 975 items from four different benchmarks showed that logical coherence is independent of correctness (r = -0.172, ns), suggesting that correct responses can arise from illogical reasoning. Claude-Haiku-4.5 scored the highest in the balanced multi-dimensional assessment (Q_bal = 0.778). This research, available on arXiv (2605.24661), underscores the importance of a more thorough evaluation of reasoning methodologies.
Key facts
- Framework includes six dimensions: Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability.
- Tested on seven LLMs across 975 items from four benchmarks.
- Logical coherence found orthogonal to correctness (r = -0.172, ns).
- Claude-Haiku-4.5 achieved highest Q_bal score of 0.778.
- Published on arXiv with ID 2605.24661.
- Proposes behavioral perspective for measuring reasoning quality.
Entities
Institutions
- arXiv