Visual Evidence Selection Reformulated for Multimodal RAG
A new framework redefines visual evidence selection in multimodal retrieval-augmented generation (RAG) by focusing on utility rather than semantic similarity. The approach, described in arXiv:2605.13277, treats evidence utility as information gain on a model's output distribution. To address intractability, the authors introduce a latent helpfulness variable and prove equivalence to answer-space utility under mild assumptions. A training-free, surrogate-accelerated method uses lightweight multimodal models to estimate utility efficiently. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate effectiveness.
Key facts
- arXiv:2605.13277 proposes a utility-oriented approach to visual evidence selection.
- Existing methods rely on semantic relevance or surface-level similarity.
- Evidence utility is defined as information gain on output distribution.
- A latent helpfulness variable is introduced to overcome answer-space optimization intractability.
- Ranking by information gain on latent variable is equivalent to answer-space utility.
- The framework is training-free and surrogate-accelerated.
- Lightweight multimodal models estimate evidence utility.
- Evaluated on MRAG-Bench and Visual-RAG across multiple model families.
Entities
—