VLMs Generate Plausible but Visually Unsupported OCR Text in Ancient Greek Editions
A study on arXiv (2605.27750) reveals that Vision-Language Models (VLMs) used for OCR in low-resource Ancient Greek critical editions produce fluent but visually unsupported text, unlike traditional OCR which generates local recognition noise. Researchers introduced controlled image perturbations and token-level grounding measures to analyze visual evidence during decoding. Under character-level perturbations, VLMs diverged sharply from perturbed ground truth while traditional OCR remained faithful. Token-level analysis showed that prior reliance is model-specific; an OCR-specialist model produced fluent lexical errors with little reliance on visual input.
Key facts
- arXiv paper 2605.27750 examines VLM failures in OCR for Ancient Greek critical editions.
- VLMs generate plausible but visually unsupported text, relying on language priors.
- Traditional OCR produces local recognition noise rather than fluent errors.
- Controlled image perturbations and token-level grounding measures were introduced.
- Under character-level perturbations, VLMs diverged from ground truth; traditional OCR remained faithful.
- Prior reliance is model-specific; an OCR-specialist model showed little visual reliance.
- Study compares open-weight VLMs with traditional OCR baselines.
- Ancient Greek is a low-resource language for OCR.
Entities
Institutions
- arXiv