VLMs Generate Plausible but Visually Unsupported OCR Text in Ancient Greek Editions

ai-technology · 2026-05-28

A study on arXiv (2605.27750) reveals that Vision-Language Models (VLMs) used for OCR in low-resource Ancient Greek critical editions produce fluent but visually unsupported text, unlike traditional OCR which generates local recognition noise. Researchers introduced controlled image perturbations and token-level grounding measures to analyze visual evidence during decoding. Under character-level perturbations, VLMs diverged sharply from perturbed ground truth while traditional OCR remained faithful. Token-level analysis showed that prior reliance is model-specific; an OCR-specialist model produced fluent lexical errors with little reliance on visual input.

Key facts

arXiv paper 2605.27750 examines VLM failures in OCR for Ancient Greek critical editions.
VLMs generate plausible but visually unsupported text, relying on language priors.
Traditional OCR produces local recognition noise rather than fluent errors.
Controlled image perturbations and token-level grounding measures were introduced.
Under character-level perturbations, VLMs diverged from ground truth; traditional OCR remained faithful.
Prior reliance is model-specific; an OCR-specialist model showed little visual reliance.
Study compares open-weight VLMs with traditional OCR baselines.
Ancient Greek is a low-resource language for OCR.

VLMs Generate Plausible but Visually Unsupported OCR Text in Ancient Greek Editions

Key facts

Entities

Institutions

Sources