MARA Framework Introduces Query-Adaptive Mechanisms for Multimodal Document Question Answering

ai-technology · 2026-04-22

The Multimodal Adaptive Retrieval-Augmented (MARA) framework tackles shortcomings in multimodal document question answering that relies on retrieval. Existing methods utilize query-agnostic document representations, overlooking important content and relying on static top-k evidence selection, which does not adapt well to uncertain information distributions. MARA introduces mechanisms that adapt to queries for both retrieval and generation. It features a Query-Aligned Region Encoder that creates multi-level document representations and adjusts them based on their relevance to the query, enhancing retrieval accuracy. Additionally, the framework incorporates a Self-Re... (truncated in source). This study was published on arXiv under the identifier 2604.16313v1 as a cross announcement. Retrieval-based multimodal document QA seeks to extract and combine pertinent information from complex, visually rich documents. Although retrieval-augmented generation (RAG) has excelled in text-based QA, its application to multimodal documents is still largely unexamined.

Key facts

The Multimodal Adaptive Retrieval-Augmented (MARA) framework is proposed for multimodal document question answering.
Current approaches rely on query-agnostic document representations that overlook salient content.
Static top-k evidence selection fails to adapt to the uncertain distribution of relevant information.
MARA introduces query-adaptive mechanisms to both retrieval and generation.
The framework includes a Query-Aligned Region Encoder that builds multi-level document representations.
Representations are reweighted based on query relevance to improve retrieval precision.
The research was announced on arXiv with identifier 2604.16313v1.
Retrieval-augmented generation (RAG) has shown strong performance in text-based QA but extensions to multimodal documents are underexplored.

MARA Framework Introduces Query-Adaptive Mechanisms for Multimodal Document Question Answering

Key facts

Entities

Institutions

Sources