MLLMs Know When Before Speaking: Temporal Grounding via Attention Cues
A new study from arXiv reveals that multimodal large language models (MLLMs) often identify the correct temporal interval for video events during the prefill stage but lose this signal during answer generation. Researchers discovered a sparse set of attention heads, termed Temporal Grounding Heads (TG-Heads), that concentrate query-to-video attention on ground-truth intervals. This perception-generation gap explains why MLLMs describe video content fluently yet produce unreliable timestamp predictions. Existing remedies require costly post-training or coarse heuristics. The work proposes recovering temporal grounding by leveraging attention cues from TG-Heads, offering a training-free method to improve video temporal grounding (VTG) performance.
Key facts
- Study published on arXiv with ID 2605.21954
- Focuses on video temporal grounding (VTG) in MLLMs
- Identifies a perception-generation gap in MLLMs
- Discovers Temporal Grounding Heads (TG-Heads) in prefill stage
- TG-Heads concentrate attention on ground-truth intervals
- Answer tokens shift attention away during autoregressive decoding
- Existing remedies are costly or coarse
- Proposes training-free method using attention cues
Entities
Institutions
- arXiv