ST-SimDiff: New Framework Balances Similarity and Difference for Efficient Video Understanding
Researchers have introduced ST-SimDiff, a training-free framework designed to reduce computational overhead in Multimodal Large Language Models (MLLMs) processing long videos. Current methods prune or merge visual tokens based on importance or similarity but overlook changes and turning points in video content. ST-SimDiff addresses this by constructing a spatio-temporal graph from visual tokens to model complex associations, then employing a parallel dual-selection strategy: similarity-based selection uses community detection to retain representative tokens, while difference-based selection captures key events. The framework balances spatiotemporal similarity and difference for efficient video understanding.
Key facts
- ST-SimDiff is a training-free framework for efficient video understanding with MLLMs.
- It addresses computational overhead from massive visual tokens in long videos.
- Existing methods prune or merge tokens based on importance or similarity.
- ST-SimDiff considers both similarity (for redundancy) and difference (for key events).
- It constructs a spatio-temporal graph from visual tokens.
- A parallel dual-selection strategy is used: similarity-based and difference-based.
- Similarity-based selection uses community detection to retain representative tokens.
- The framework is proposed in arXiv paper 2605.22158.
Entities
Institutions
- arXiv