ARTFEED — Contemporary Art Intelligence

OTT-Vid: Optimal Transport Token Compression for Video LLMs

ai-technology · 2026-05-13

Researchers propose OTT-Vid, a training-free token compression method for Video Large Language Models (Video-LLMs) that uses optimal transport to reduce visual tokens across frames. The method has two stages: spatial pruning identifies representative content per frame, and optimal transport between neighboring frames estimates temporal compressibility with non-uniform token mass to protect semantically important tokens. This addresses the growing inference cost of Video-LLMs as they scale to longer videos. The approach improves upon existing methods that rely on cross-frame similarity or segmentation heuristics.

Key facts

  • OTT-Vid is a training-free token compression method for Video-LLMs.
  • It uses optimal transport between neighboring frames for temporal compression.
  • Spatial pruning identifies representative content within each frame.
  • Non-uniform token mass protects semantically important tokens.
  • Existing methods rely on cross-frame similarity or segmentation heuristics.
  • Video-LLMs scale to longer and more complex videos.
  • Inference cost grows due to large volume of visual tokens.
  • The method is proposed in arXiv paper 2605.11803.

Entities

Institutions

  • arXiv

Sources