Confidence-Based Cascade Scoring for Educational Assessment

ai-technology · 2026-04-24

A recent study published on arXiv (2604.19781) investigates the potential of using verbalized confidence from small language models (LMs) as a routing mechanism in cascade systems for the automated evaluation of student assignments. This cascade strategy employs small LMs for simpler tasks and directs more challenging ones to larger LMs. Researchers analyzed 2,100 expert-scored decisions from student-AI math dialogues, utilizing model pairs from GPT-5.4, Claude 4.5+, and Gemini 3.1. Notable results indicate significant variation in confidence discrimination among small LMs, with the highest achieving an AUROC of 0.857, while the lowest produced nearly uniform confidence distributions. The best cascade method approached large-LM accuracy (kappa 0.802 vs. 0.819), aiming to optimize accuracy, cost, and latency in automated scoring.

Key facts

arXiv paper 2604.19781 explores verbalized confidence as routing signal in cascade scoring systems.
Study uses 2,100 expert-scored decisions from student-AI math conversations.
Models evaluated: GPT-5.4, Claude 4.5+, Gemini 3.1.
Best small LM achieved AUROC 0.857 for confidence discrimination.
Worst small LM produced near-degenerate confidence distribution.
Lower LM confidence correlated with annotator disagreement and longer scoring times.
Best cascade achieved kappa 0.802 vs. 0.819 for large LM alone.
Goal: balance accuracy, cost, and latency in automated scoring.

Confidence-Based Cascade Scoring for Educational Assessment

Key facts

Entities

Institutions

Sources