Temporal Token Fusion: Training-Free Compression for Video-Language Models

ai-technology · 2026-05-11

A new method called Temporal Token Fusion (TTF) addresses the high inference costs of video-language models (VLMs) caused by the large number of visual tokens. For instance, 32 frames at 448x448 resolution produce over 8,000 visual tokens in Qwen3-VL, making LLM prefill a bottleneck. Existing compression techniques rely on global similarity or attention guidance, which add overhead. TTF is a training-free, plug-and-play framework that compresses tokens before they enter the LLM by exploiting temporal redundancy. It selects an anchor frame and performs local window similarity searches (e.g., 3x3) on subsequent frames, fusing tokens above a threshold. The compressed sequence maintains positional consistency through coordinate realignment, integrating seamlessly with existing VLM pipelines. The paper reports results on Qwen3-VL-8B with a threshold of 0.70.

Key facts

TTF is a training-free token compression method for video-language models.
It reduces visual token counts by exploiting temporal redundancy across frames.
The method selects an anchor frame and fuses similar tokens via local window search.
TTF maintains positional consistency through coordinate realignment.
It is designed as a plug-and-play module for existing VLM pipelines.
The paper uses Qwen3-VL-8B with a threshold of 0.70 for experiments.
32 frames at 448x448 resolution yield over 8,000 visual tokens in Qwen3-VL.
TTF addresses the LLM prefill bottleneck in video processing.

Entities

—

Sources

arXiv cs.AI — 2026-05-11