ROVER: Lightweight Plugin for Grounded Multi-Image Reasoning in MLLMs
Researchers have unveiled ROVER, which stands for Routing Object-centric Visual Evidence for grounded multi-image Reasoning. This new, compact plugin is aimed at enhancing multimodal large language models (MLLMs). Unlike existing methods that rely on cropped images or specific regions, which can hurt overall scene understanding and increase decoding costs, ROVER employs a unique token triplet for each object prediction. This approach not only combines reasoning context but also pulls in signals from within the images, streamlining the process of directing global visual evidence without needing complicated supervision. The research can be found on arXiv under the code 2605.27959, addressing issues in selecting adaptive visual features for MLLMs.
Key facts
- ROVER is a lightweight, learnable plugin for MLLMs.
- It routes global visual evidence for grounded multi-image reasoning.
- Existing methods use cropped image patches or RoI features, which weaken scene understanding.
- ROVER injects a step-specific token triplet per object grounding prediction.
- The triplet aggregates reasoning context and distills intra-image cues.
- No fine-grained supervision or complex heuristics are required.
- Published on arXiv with ID 2605.27959.
- Aims to improve efficiency and holistic understanding in MLLMs.
Entities
Institutions
- arXiv