MLLMs Know When Before Speaking: Temporal Grounding via Attention Cues

ai-technology · 2026-05-23

A new study from arXiv reveals that multimodal large language models (MLLMs) often identify the correct temporal interval for video events during the prefill stage but lose this signal during answer generation. Researchers discovered a sparse set of attention heads, termed Temporal Grounding Heads (TG-Heads), that concentrate query-to-video attention on ground-truth intervals. This perception-generation gap explains why MLLMs describe video content fluently yet produce unreliable timestamp predictions. Existing remedies require costly post-training or coarse heuristics. The work proposes recovering temporal grounding by leveraging attention cues from TG-Heads, offering a training-free method to improve video temporal grounding (VTG) performance.

Key facts

Study published on arXiv with ID 2605.21954
Focuses on video temporal grounding (VTG) in MLLMs
Identifies a perception-generation gap in MLLMs
Discovers Temporal Grounding Heads (TG-Heads) in prefill stage
TG-Heads concentrate attention on ground-truth intervals
Answer tokens shift attention away during autoregressive decoding
Existing remedies are costly or coarse
Proposes training-free method using attention cues

MLLMs Know When Before Speaking: Temporal Grounding via Attention Cues

Key facts

Entities

Institutions

Sources