LLMs Show Mid-Range Degradation in Automated Short Answer Scoring

other · 2026-05-11

A recent investigation published on arXiv (2605.07647) explores how task-specific adaptation correlates with quality-conditioned scoring agreement in automated short answer scoring (ASAS). This research evaluates three large language models (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot settings, alongside a fine-tuned BERT-based encoder and a human expert, analyzing several hundred student responses to two open-ended biology questions, which included ground truth scores provided by a biology education expert. Findings indicate that human-to-human agreement remains the highest and consistent across all quality levels, whereas AI models show a decline in agreement, particularly with partially correct answers that require nuanced understanding. The study underscores the challenges LLMs face in few-shot contexts for intricate scoring tasks.

Key facts

Study compares GPT-5.2, GPT-4o, Claude Opus 4.5, fine-tuned BERT, and human expert
Uses two open-ended biology items with several hundred student responses
Ground truth scores provided by a biology education expert
Human-human agreement is highest and stable across all quality levels
All AI models show mid-range degradation on partially correct responses
Task-specific adaptation reduces alignment on complex scoring tasks
ASAS paradigm shifting from discriminative models to LLMs in few-shot settings
Paper published on arXiv with ID 2605.07647

LLMs Show Mid-Range Degradation in Automated Short Answer Scoring

Key facts

Entities

Institutions

Sources