ARTFEED — Contemporary Art Intelligence

VLMs Generate Plausible but Visually Unsupported OCR Text in Ancient Greek Editions

ai-technology · 2026-05-28

A study on arXiv (2605.27750) reveals that Vision-Language Models (VLMs) used for OCR in low-resource Ancient Greek critical editions produce fluent but visually unsupported text, unlike traditional OCR which generates local recognition noise. Researchers introduced controlled image perturbations and token-level grounding measures to analyze visual evidence during decoding. Under character-level perturbations, VLMs diverged sharply from perturbed ground truth while traditional OCR remained faithful. Token-level analysis showed that prior reliance is model-specific; an OCR-specialist model produced fluent lexical errors with little reliance on visual input.

Key facts

  • arXiv paper 2605.27750 examines VLM failures in OCR for Ancient Greek critical editions.
  • VLMs generate plausible but visually unsupported text, relying on language priors.
  • Traditional OCR produces local recognition noise rather than fluent errors.
  • Controlled image perturbations and token-level grounding measures were introduced.
  • Under character-level perturbations, VLMs diverged from ground truth; traditional OCR remained faithful.
  • Prior reliance is model-specific; an OCR-specialist model showed little visual reliance.
  • Study compares open-weight VLMs with traditional OCR baselines.
  • Ancient Greek is a low-resource language for OCR.

Entities

Institutions

  • arXiv

Sources