ARTFEED — Contemporary Art Intelligence

M3DocDep: LVLM-Based Multi-Page Document Chunking for RAG

other · 2026-05-20

Researchers propose M3DocDep, a pipeline using large vision-language models (LVLMs) to improve document chunking for retrieval-augmented generation (RAG) in multi-page industrial documents. The method recovers block-level dependencies and constructs chunks along a document tree, addressing issues like cross-page parent-child relations and figure-caption bindings. It employs SharedDet for OCR, SoftROI pooling for block embeddings, a biaffine head for edge scoring, and MST constraints for tree decoding. The approach is detailed in arXiv paper 2605.18774.

Key facts

  • M3DocDep is an LVLM-based pipeline for multi-modal, multi-page, multi-document dependency chunking.
  • It recovers block-level dependencies and constructs chunks along the recovered document tree.
  • The pipeline uses SharedDet as a common DP+OCR preprocessing layer.
  • It extracts multimodal block embeddings with boundary-aware SoftROI pooling.
  • Candidate parent-child edges are scored with a biaffine head.
  • A globally valid dependency tree is decoded with MST constraints.
  • Tree-guided chunks are annotated with section paths and page ranges.
  • The method aims to improve RAG by preserving document structure.

Entities

Institutions

  • arXiv

Sources