OTT-Vid: Optimal Transport Token Compression for Video LLMs
Researchers propose OTT-Vid, a training-free token compression method for Video Large Language Models (Video-LLMs) that uses optimal transport to reduce visual tokens across frames. The method has two stages: spatial pruning identifies representative content per frame, and optimal transport between neighboring frames estimates temporal compressibility with non-uniform token mass to protect semantically important tokens. This addresses the growing inference cost of Video-LLMs as they scale to longer videos. The approach improves upon existing methods that rely on cross-frame similarity or segmentation heuristics.
Key facts
- OTT-Vid is a training-free token compression method for Video-LLMs.
- It uses optimal transport between neighboring frames for temporal compression.
- Spatial pruning identifies representative content within each frame.
- Non-uniform token mass protects semantically important tokens.
- Existing methods rely on cross-frame similarity or segmentation heuristics.
- Video-LLMs scale to longer and more complex videos.
- Inference cost grows due to large volume of visual tokens.
- The method is proposed in arXiv paper 2605.11803.
Entities
Institutions
- arXiv