Doc-CoB: Visual Chain-of-Boxes Reasoning for Document Understanding
A novel framework known as Doc-CoB (Chain-of-Boxes) has been developed to improve document comprehension by incorporating coarse-to-fine layout-aware visual reasoning into multimodal large language models. This method overcomes the shortcomings of current techniques that either apply uniform treatment to all layouts or concentrate excessively on small sections. Doc-CoB systematically hones in on layouts relevant to queries while maintaining an overview of the entire document, initially identifying crucial layout boxes and subsequently employing visual prompting for enhanced insight. This framework is tailored for information extraction and question answering from document images, where visual elements are dense and queries are tied to specific layout areas. The research is available on arXiv under identifier 2505.18603.
Key facts
- Doc-CoB stands for Chain-of-Boxes
- It integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models
- The framework selects key layout boxes then focuses on them with visual prompting
- It preserves global document information while focusing on query-relevant layouts
- The method addresses limitations of one-pass strategies and overly narrow focus
- It is designed for question answering and information extraction over document images
- The research is published on arXiv with identifier 2505.18603
- The announcement type is replace
Entities
Institutions
- arXiv