LLM-Based Grading of Handwritten Math Shows High Accuracy

other · 2026-05-20

A recent study investigates the effectiveness of vision-capable large language models (LLMs) in automating the grading of handwritten math assignments. This research, available on arXiv, builds upon an earlier pipeline designed for typed answers by merging transcription and rubric-based assessment of photographic submissions into a single LLM invocation. The evaluation involved student submissions from two STEM courses at a university, where AI grading outcomes were measured against human-assigned benchmarks at the rubric-item level. Findings indicate a high level of accuracy, with 87% of errors in the top-performing model linked to transcription issues rather than incorrect rubric application. The study also identifies frequent error types and underscores the potential of LLMs for scalable assessments in real educational environments.

Key facts

arXiv:2605.19043v1
Automated grading of handwritten mathematics using vision-capable LLMs
Extends prior pipeline for typed responses
Integrates transcription and rubric-based evaluation in single LLM call
Evaluated on student work from two university STEM courses
Compared AI grading against human-assigned ground truth at rubric-item level
87% of errors in best model due to transcription failures
Study categorizes common error types

LLM-Based Grading of Handwritten Math Shows High Accuracy

Key facts

Entities

Institutions

Sources