Research Proposes New Metrics for Automated Essay Scoring Accuracy
A new research paper introduces two dataset-specific QWK ceilings for evaluating automated essay scoring systems. The theoretical ceiling represents the maximum QWK achievable by an ideal model predicting latent true scores despite label noise. The human-like ceiling provides a practical target for AES systems intended to replace single human raters. These ceilings are derived from classical test theory reliability concepts and can be estimated from standard two-rater benchmarks without requiring additional annotation. The research addresses limitations in current evaluation methods where benchmark labels contain inevitable human scoring errors. The study demonstrates that human-human QWK, often used as ceiling references, can be misleading. The paper was published on arXiv under identifier 2604.19131v1. The work focuses on improving assessment of AES system accuracy for potential deployment.
Key facts
- Automated essay scoring is commonly evaluated using quadratic weighted kappa
- Benchmark labels contain inevitable human scoring errors
- Researchers derived two dataset-specific QWK ceilings from classical test theory
- The theoretical ceiling represents maximum QWK for ideal AES models
- The human-like ceiling provides practical target for AES replacing single raters
- Ceilings can be estimated from standard two-rater benchmarks without extra annotation
- Human-human QWK can be misleading as ceiling reference
- Paper published on arXiv with identifier 2604.19131v1
Entities
Institutions
- arXiv