DocVAL: Spatial Reasoning Distillation for Document VQA

ai-technology · 2026-05-25

Researchers propose DocVAL, a validated chain-of-thought (CoT) distillation framework for document visual question answering (VQA). DocVAL transfers explicit spatial reasoning from large teacher vision-language models (VLMs) to compact student VLMs, addressing localization degradation in smaller models. The framework combines teacher-generated spatial CoT supervision, a rule-based dual-mode validator that filters low-quality signals and provides pixel-level corrective feedback, and a validation-driven two-stage training procedure with iterative refinement. The work targets efficient deployment of VLMs with strong spatial grounding in complex document layouts.

Key facts

DocVAL is a validated chain-of-thought distillation framework for document VQA.
It transfers spatial reasoning from large teacher VLMs to compact student VLMs.
Includes a rule-based dual-mode validator for filtering and corrective feedback.
Uses a two-stage training procedure with iterative refinement.
Aims to reduce inference cost and latency while maintaining spatial grounding.
Addresses localization degradation in compact VLMs under standard fine-tuning.
Published on arXiv with ID 2511.22521.
Announce type is replace-cross.

DocVAL: Spatial Reasoning Distillation for Document VQA

Key facts

Entities

Institutions

Sources