ARTFEED — Contemporary Art Intelligence

LLMs Show Mid-Range Degradation in Automated Short Answer Scoring

other · 2026-05-11

A recent investigation published on arXiv (2605.07647) explores how task-specific adaptation correlates with quality-conditioned scoring agreement in automated short answer scoring (ASAS). This research evaluates three large language models (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot settings, alongside a fine-tuned BERT-based encoder and a human expert, analyzing several hundred student responses to two open-ended biology questions, which included ground truth scores provided by a biology education expert. Findings indicate that human-to-human agreement remains the highest and consistent across all quality levels, whereas AI models show a decline in agreement, particularly with partially correct answers that require nuanced understanding. The study underscores the challenges LLMs face in few-shot contexts for intricate scoring tasks.

Key facts

  • Study compares GPT-5.2, GPT-4o, Claude Opus 4.5, fine-tuned BERT, and human expert
  • Uses two open-ended biology items with several hundred student responses
  • Ground truth scores provided by a biology education expert
  • Human-human agreement is highest and stable across all quality levels
  • All AI models show mid-range degradation on partially correct responses
  • Task-specific adaptation reduces alignment on complex scoring tasks
  • ASAS paradigm shifting from discriminative models to LLMs in few-shot settings
  • Paper published on arXiv with ID 2605.07647

Entities

Institutions

  • arXiv

Sources