MoVT: Adaptive Visual Reasoning via Mixture-of-Visual-Thoughts
A new adaptive reasoning framework called Mixture-of-Visual-Thoughts (MoVT) has been introduced by researchers, which integrates various reasoning approaches into a single model and determines the suitable mode based on the context. This is facilitated by the two-stage Adaptive Visual Reasoning learning framework known as AdaVaR. In the initial supervised cold-start phase, various reasoning modes are learned and unified. Subsequently, the model's ability to select modes is enhanced through reinforcement learning with the AdaGRPO algorithm. Experimental results indicate that AdaVaR successfully directs the model in learning and distinguishing among multiple modes, achieving context-sensitive mode selection and consistent performance improvements across different scenarios. The paper can be found on arXiv with ID 2509.22746.
Key facts
- MoVT unifies different reasoning modes within a single model.
- AdaVaR is a two-stage Adaptive Visual Reasoning learning framework.
- Supervised cold-start stage unifies and learns different modes.
- Mode selection capability is induced via RL with AdaGRPO algorithm.
- Experiments show consistent improvement across various scenarios.
- Paper available on arXiv: 2509.22746.
- Focus is on general visual reasoning capabilities.
- Method is context-adaptive.
Entities
Institutions
- arXiv