VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
Researchers have introduced VideoRouter, a dual-router framework that adapts queries and is based on InternVL, designed for effective long-video comprehension. This system tackles the scalability issues faced by large multimodal video models, which are hindered by lengthy visual-token sequences that result in significant memory usage and latency. VideoRouter incorporates a Semantic Router to determine the allocation strategy (either broad temporal coverage or adaptive high-resolution retention) and an Image Router to evaluate frame relevance through early LLM layers. This allows for aggressive compression of less pertinent frames while maintaining detail on essential evidence. To facilitate router training, the team created Video-QTR-10K, a dataset aimed at learning allocation policies. This method is query-adaptive, contrasting with fixed compression strategies, and seeks to enhance evidence allocation when visual data is unevenly distributed over time.
Key facts
- VideoRouter is a query-adaptive dual-router framework for long-video understanding.
- It is built on InternVL.
- The Semantic Router predicts the dominant allocation policy.
- The Image Router uses early LLM layers to score frame relevance.
- The system enables aggressive compression on less relevant frames.
- It preserves detail on critical evidence frames.
- The training dataset is Video-QTR-10K.
- The approach addresses scalability bottlenecks in video large multimodal models.
Entities
—