LFRAG: Block-Level Retrieval for Multimodal Document Understanding

ai-technology · 2026-05-25

A new framework called LFRAG (Layout-oriented Fine-grained Retrieval-Augmented Generation) improves multimodal document understanding by shifting from page-level to block-level retrieval. Existing multimodal RAG systems rely on coarse-grained page-level retrieval, missing fine-grained semantic and layout structures in visually rich documents, which compromises accuracy and adds redundant context. LFRAG performs layout segmentation to create semantically coherent fine-grained retrieval units and uses a semantic-layout fusion encoder with cross-attention to integrate local semantics with global context. Block-level late interaction retrieval enables precise query-content alignment. The paper is published on arXiv with ID 2605.22829.

Key facts

LFRAG stands for Layout-oriented Fine-grained Retrieval-Augmented Generation.
It advances multimodal RAG from page-level to block-level retrieval.
Layout segmentation constructs fine-grained retrieval units.
A semantic-layout fusion encoder uses cross-attention.
Block-level late interaction retrieval improves query-content alignment.
The paper is on arXiv:2605.22829.
Existing multimodal RAG systems use coarse-grained page-level retrieval.
The approach reduces redundant context in downstream tasks.

LFRAG: Block-Level Retrieval for Multimodal Document Understanding

Key facts

Entities

Institutions

Sources