Research Proposes New Metrics for Automated Essay Scoring Accuracy

ai-technology · 2026-04-22

A new research paper introduces two dataset-specific QWK ceilings for evaluating automated essay scoring systems. The theoretical ceiling represents the maximum QWK achievable by an ideal model predicting latent true scores despite label noise. The human-like ceiling provides a practical target for AES systems intended to replace single human raters. These ceilings are derived from classical test theory reliability concepts and can be estimated from standard two-rater benchmarks without requiring additional annotation. The research addresses limitations in current evaluation methods where benchmark labels contain inevitable human scoring errors. The study demonstrates that human-human QWK, often used as ceiling references, can be misleading. The paper was published on arXiv under identifier 2604.19131v1. The work focuses on improving assessment of AES system accuracy for potential deployment.

Key facts

Automated essay scoring is commonly evaluated using quadratic weighted kappa
Benchmark labels contain inevitable human scoring errors
Researchers derived two dataset-specific QWK ceilings from classical test theory
The theoretical ceiling represents maximum QWK for ideal AES models
The human-like ceiling provides practical target for AES replacing single raters
Ceilings can be estimated from standard two-rater benchmarks without extra annotation
Human-human QWK can be misleading as ceiling reference
Paper published on arXiv with identifier 2604.19131v1

Research Proposes New Metrics for Automated Essay Scoring Accuracy

Key facts

Entities

Institutions

Sources