M3DocDep: LVLM-Based Multi-Page Document Chunking for RAG

other · 2026-05-20

Researchers propose M3DocDep, a pipeline using large vision-language models (LVLMs) to improve document chunking for retrieval-augmented generation (RAG) in multi-page industrial documents. The method recovers block-level dependencies and constructs chunks along a document tree, addressing issues like cross-page parent-child relations and figure-caption bindings. It employs SharedDet for OCR, SoftROI pooling for block embeddings, a biaffine head for edge scoring, and MST constraints for tree decoding. The approach is detailed in arXiv paper 2605.18774.

Key facts

M3DocDep is an LVLM-based pipeline for multi-modal, multi-page, multi-document dependency chunking.
It recovers block-level dependencies and constructs chunks along the recovered document tree.
The pipeline uses SharedDet as a common DP+OCR preprocessing layer.
It extracts multimodal block embeddings with boundary-aware SoftROI pooling.
Candidate parent-child edges are scored with a biaffine head.
A globally valid dependency tree is decoded with MST constraints.
Tree-guided chunks are annotated with section paths and page ranges.
The method aims to improve RAG by preserving document structure.

M3DocDep: LVLM-Based Multi-Page Document Chunking for RAG

Key facts

Entities

Institutions

Sources