M3DocDep: LVLM-Based Multi-Page Document Chunking for RAG
Researchers propose M3DocDep, a pipeline using large vision-language models (LVLMs) to improve document chunking for retrieval-augmented generation (RAG) in multi-page industrial documents. The method recovers block-level dependencies and constructs chunks along a document tree, addressing issues like cross-page parent-child relations and figure-caption bindings. It employs SharedDet for OCR, SoftROI pooling for block embeddings, a biaffine head for edge scoring, and MST constraints for tree decoding. The approach is detailed in arXiv paper 2605.18774.
Key facts
- M3DocDep is an LVLM-based pipeline for multi-modal, multi-page, multi-document dependency chunking.
- It recovers block-level dependencies and constructs chunks along the recovered document tree.
- The pipeline uses SharedDet as a common DP+OCR preprocessing layer.
- It extracts multimodal block embeddings with boundary-aware SoftROI pooling.
- Candidate parent-child edges are scored with a biaffine head.
- A globally valid dependency tree is decoded with MST constraints.
- Tree-guided chunks are annotated with section paths and page ranges.
- The method aims to improve RAG by preserving document structure.
Entities
Institutions
- arXiv