Fre-Res: A New Video Token Compression Framework for MLLMs

ai-technology · 2026-05-20

A new research paper presents Fre-Res, a cost-effective dual-track video-token compression system aimed at resolving the conflict between spatial accuracy and temporal coverage in Video Multimodal Large Language Models (MLLMs). This framework distinguishes between spatial and temporal data by maintaining sparse, high-fidelity spatial anchors while utilizing compact residual-frequency tokens to depict dense temporal changes. It employs temporal 1D-DCT on inter-frame residual paths in vision-latent space, capitalizing on the notable low-frequency concentration observed. To synchronize frequency-domain dynamics with inherent visual embeddings, Fre-Res incorporates a Spatial-Guided Absorber that integrates temporal residual data into the corresponding spatial anchor tokens. The approach demonstrates strong performance on both fine-grained short-video and long-video reasoning benchmarks. The paper can be found on arXiv with ID 2605.16366.

Key facts

Fre-Res is a budget-adaptive dual-track video-token compression framework.
It separates spatial fidelity and temporal coverage in Video MLLMs.
Preserves sparse high-fidelity spatial anchors.
Represents temporal evolution via compact residual-frequency tokens.
Applies temporal 1D-DCT to inter-frame residual trajectories.
Uses a Spatial-Guided Absorber to inject temporal information.
Achieves favorable results on short-video and long-video benchmarks.
Paper available on arXiv with ID 2605.16366.

Fre-Res: A New Video Token Compression Framework for MLLMs

Key facts

Entities

Institutions

Sources