LLM Self-Consistency and Reasoning Effort in Automated Scoring

ai-technology · 2026-05-01

A recent investigation into automated scoring with large language models (LLMs) revealed that choosing models strategically and adjusting reasoning settings outperformed ensembling methods. Researchers analyzed 900 high school math discussions in comparison to human-scored benchmarks, utilizing models from OpenAI and Google. They discovered that temperature sampling enhanced accuracy compared to deterministic approaches, yet expanding the ensemble from 1 to 7 did not result in notable improvements. Increased reasoning effort correlated positively with scoring accuracy, although the advantages differed among model families. An efficiency frontier analysis pinpointed Gemini 3.1 Pro Preview as the most accurate but expensive option at low reasoning, while GPT-5.4 Nano and Mini, with no reasoning, provided the optimal cost-performance ratio.

Key facts

Self-consistency and reasoning effort were examined for scoring conversation-based assessment items in high school mathematics.
900 student conversations were evaluated against human-scored ground truths.
Models from OpenAI and Google were used.
Temperature sampling significantly improved accuracy over deterministic calls.
Increasing ensemble size from j=1 to 7 produced no significant gains.
Higher reasoning effort showed a significant positive linear trend with scoring accuracy.
Benefit of reasoning effort varied by model family.
Gemini 3.1 Pro Preview at low reasoning was most accurate but costly; GPT-5.4 Nano and Mini with no reasoning offered best cost-performance balance.

LLM Self-Consistency and Reasoning Effort in Automated Scoring

Key facts

Entities

Institutions

Sources