VideoSEAL: Decoupling Answer Authority to Fix Evidence Misalignment in Long Video QA

ai-technology · 2026-05-14

Researchers propose VideoSEAL, a framework addressing evidence misalignment in agentic long video understanding, where models produce correct answers unsupported by retrieved evidence. Two diagnostics—temporal and semantic groundedness—reveal prompt pressure from shared-context saturation and reward pressure from outcome-only optimization as root causes. The decoupled planner-inspector paradigm separates long-horizon planning from answer authority.

Key facts

arXiv:2605.12571
Long video QA requires locating sparse, time-scattered visual evidence
Current MLLMs perform well on short videos but struggle with long videos
Evidence misalignment: correct answers not supported by retrieved evidence
Two diagnostics: temporal groundedness and semantic groundedness
Prompt pressure from shared-context saturation at inference time
Reward pressure from outcome-only optimization during training
Decoupled planner-inspector framework proposed

VideoSEAL: Decoupling Answer Authority to Fix Evidence Misalignment in Long Video QA

Key facts

Entities

Institutions

Sources