LLMs Show Mid-Range Degradation in Automated Short Answer Scoring
A recent investigation published on arXiv (2605.07647) explores how task-specific adaptation correlates with quality-conditioned scoring agreement in automated short answer scoring (ASAS). This research evaluates three large language models (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot settings, alongside a fine-tuned BERT-based encoder and a human expert, analyzing several hundred student responses to two open-ended biology questions, which included ground truth scores provided by a biology education expert. Findings indicate that human-to-human agreement remains the highest and consistent across all quality levels, whereas AI models show a decline in agreement, particularly with partially correct answers that require nuanced understanding. The study underscores the challenges LLMs face in few-shot contexts for intricate scoring tasks.
Key facts
- Study compares GPT-5.2, GPT-4o, Claude Opus 4.5, fine-tuned BERT, and human expert
- Uses two open-ended biology items with several hundred student responses
- Ground truth scores provided by a biology education expert
- Human-human agreement is highest and stable across all quality levels
- All AI models show mid-range degradation on partially correct responses
- Task-specific adaptation reduces alignment on complex scoring tasks
- ASAS paradigm shifting from discriminative models to LLMs in few-shot settings
- Paper published on arXiv with ID 2605.07647
Entities
Institutions
- arXiv