Doc-CoB: Visual Chain-of-Boxes Reasoning for Document Understanding

other · 2026-05-27

A novel framework known as Doc-CoB (Chain-of-Boxes) has been developed to improve document comprehension by incorporating coarse-to-fine layout-aware visual reasoning into multimodal large language models. This method overcomes the shortcomings of current techniques that either apply uniform treatment to all layouts or concentrate excessively on small sections. Doc-CoB systematically hones in on layouts relevant to queries while maintaining an overview of the entire document, initially identifying crucial layout boxes and subsequently employing visual prompting for enhanced insight. This framework is tailored for information extraction and question answering from document images, where visual elements are dense and queries are tied to specific layout areas. The research is available on arXiv under identifier 2505.18603.

Key facts

Doc-CoB stands for Chain-of-Boxes
It integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models
The framework selects key layout boxes then focuses on them with visual prompting
It preserves global document information while focusing on query-relevant layouts
The method addresses limitations of one-pass strategies and overly narrow focus
It is designed for question answering and information extraction over document images
The research is published on arXiv with identifier 2505.18603
The announcement type is replace

Doc-CoB: Visual Chain-of-Boxes Reasoning for Document Understanding

Key facts

Entities

Institutions

Sources