LFRAG: Block-Level Retrieval for Multimodal Document Understanding
A new framework called LFRAG (Layout-oriented Fine-grained Retrieval-Augmented Generation) improves multimodal document understanding by shifting from page-level to block-level retrieval. Existing multimodal RAG systems rely on coarse-grained page-level retrieval, missing fine-grained semantic and layout structures in visually rich documents, which compromises accuracy and adds redundant context. LFRAG performs layout segmentation to create semantically coherent fine-grained retrieval units and uses a semantic-layout fusion encoder with cross-attention to integrate local semantics with global context. Block-level late interaction retrieval enables precise query-content alignment. The paper is published on arXiv with ID 2605.22829.
Key facts
- LFRAG stands for Layout-oriented Fine-grained Retrieval-Augmented Generation.
- It advances multimodal RAG from page-level to block-level retrieval.
- Layout segmentation constructs fine-grained retrieval units.
- A semantic-layout fusion encoder uses cross-attention.
- Block-level late interaction retrieval improves query-content alignment.
- The paper is on arXiv:2605.22829.
- Existing multimodal RAG systems use coarse-grained page-level retrieval.
- The approach reduces redundant context in downstream tasks.
Entities
Institutions
- arXiv