MARA Framework Introduces Query-Adaptive Mechanisms for Multimodal Document Question Answering
The Multimodal Adaptive Retrieval-Augmented (MARA) framework tackles shortcomings in multimodal document question answering that relies on retrieval. Existing methods utilize query-agnostic document representations, overlooking important content and relying on static top-k evidence selection, which does not adapt well to uncertain information distributions. MARA introduces mechanisms that adapt to queries for both retrieval and generation. It features a Query-Aligned Region Encoder that creates multi-level document representations and adjusts them based on their relevance to the query, enhancing retrieval accuracy. Additionally, the framework incorporates a Self-Re... (truncated in source). This study was published on arXiv under the identifier 2604.16313v1 as a cross announcement. Retrieval-based multimodal document QA seeks to extract and combine pertinent information from complex, visually rich documents. Although retrieval-augmented generation (RAG) has excelled in text-based QA, its application to multimodal documents is still largely unexamined.
Key facts
- The Multimodal Adaptive Retrieval-Augmented (MARA) framework is proposed for multimodal document question answering.
- Current approaches rely on query-agnostic document representations that overlook salient content.
- Static top-k evidence selection fails to adapt to the uncertain distribution of relevant information.
- MARA introduces query-adaptive mechanisms to both retrieval and generation.
- The framework includes a Query-Aligned Region Encoder that builds multi-level document representations.
- Representations are reweighted based on query relevance to improve retrieval precision.
- The research was announced on arXiv with identifier 2604.16313v1.
- Retrieval-augmented generation (RAG) has shown strong performance in text-based QA but extensions to multimodal documents are underexplored.
Entities
Institutions
- arXiv