ARTFEED — Contemporary Art Intelligence

LFRAG: Block-Level Retrieval for Multimodal Document Understanding

ai-technology · 2026-05-25

A new framework called LFRAG (Layout-oriented Fine-grained Retrieval-Augmented Generation) improves multimodal document understanding by shifting from page-level to block-level retrieval. Existing multimodal RAG systems rely on coarse-grained page-level retrieval, missing fine-grained semantic and layout structures in visually rich documents, which compromises accuracy and adds redundant context. LFRAG performs layout segmentation to create semantically coherent fine-grained retrieval units and uses a semantic-layout fusion encoder with cross-attention to integrate local semantics with global context. Block-level late interaction retrieval enables precise query-content alignment. The paper is published on arXiv with ID 2605.22829.

Key facts

  • LFRAG stands for Layout-oriented Fine-grained Retrieval-Augmented Generation.
  • It advances multimodal RAG from page-level to block-level retrieval.
  • Layout segmentation constructs fine-grained retrieval units.
  • A semantic-layout fusion encoder uses cross-attention.
  • Block-level late interaction retrieval improves query-content alignment.
  • The paper is on arXiv:2605.22829.
  • Existing multimodal RAG systems use coarse-grained page-level retrieval.
  • The approach reduces redundant context in downstream tasks.

Entities

Institutions

  • arXiv

Sources