OTT-Vid: Optimal Transport Token Compression for Video LLMs

ai-technology · 2026-05-13

Researchers propose OTT-Vid, a training-free token compression method for Video Large Language Models (Video-LLMs) that uses optimal transport to reduce visual tokens across frames. The method has two stages: spatial pruning identifies representative content per frame, and optimal transport between neighboring frames estimates temporal compressibility with non-uniform token mass to protect semantically important tokens. This addresses the growing inference cost of Video-LLMs as they scale to longer videos. The approach improves upon existing methods that rely on cross-frame similarity or segmentation heuristics.

Key facts

OTT-Vid is a training-free token compression method for Video-LLMs.
It uses optimal transport between neighboring frames for temporal compression.
Spatial pruning identifies representative content within each frame.
Non-uniform token mass protects semantically important tokens.
Existing methods rely on cross-frame similarity or segmentation heuristics.
Video-LLMs scale to longer and more complex videos.
Inference cost grows due to large volume of visual tokens.
The method is proposed in arXiv paper 2605.11803.

OTT-Vid: Optimal Transport Token Compression for Video LLMs

Key facts

Entities

Institutions

Sources