DocVAL: Spatial Reasoning Distillation for Document VQA
Researchers propose DocVAL, a validated chain-of-thought (CoT) distillation framework for document visual question answering (VQA). DocVAL transfers explicit spatial reasoning from large teacher vision-language models (VLMs) to compact student VLMs, addressing localization degradation in smaller models. The framework combines teacher-generated spatial CoT supervision, a rule-based dual-mode validator that filters low-quality signals and provides pixel-level corrective feedback, and a validation-driven two-stage training procedure with iterative refinement. The work targets efficient deployment of VLMs with strong spatial grounding in complex document layouts.
Key facts
- DocVAL is a validated chain-of-thought distillation framework for document VQA.
- It transfers spatial reasoning from large teacher VLMs to compact student VLMs.
- Includes a rule-based dual-mode validator for filtering and corrective feedback.
- Uses a two-stage training procedure with iterative refinement.
- Aims to reduce inference cost and latency while maintaining spatial grounding.
- Addresses localization degradation in compact VLMs under standard fine-tuning.
- Published on arXiv with ID 2511.22521.
- Announce type is replace-cross.
Entities
Institutions
- arXiv