IVR-R1: Iterative Visual-Grounded Reasoning for RL-Based Multimodal LLMs
A novel training framework for reinforcement learning, named IVR-R1 (Iterative Visual-grounded Reasoning), tackles issues of visual hallucination and logical inaccuracies in multimodal large language models during extended reasoning tasks. This method features dynamic visual re-alignment, which proactively adjusts reasoning paths to enhance policy optimization. IVR-R1 incorporates a reward-based screening system to detect erroneous rollouts and applies detailed corrections. The strategy aims to address the information imbalance between textual data and visual contexts that diminishes visual grounding as reasoning sequences progress. This research is available on arXiv under the identifier 2605.23997.
Key facts
- IVR-R1 is a novel RL training framework for multimodal LLMs.
- It addresses visual hallucination and logical errors in long-horizon reasoning.
- The method uses dynamic visual re-alignment to rectify reasoning trajectories.
- A reward-driven screening mechanism identifies flawed rollouts.
- The paper is available on arXiv as 2605.23997.
- The framework aims to overcome information asymmetry between text and visuals.
- It performs fine-grained corrections during policy optimization.
- The approach targets complex visual reasoning tasks.
Entities
Institutions
- arXiv