VideoSEAL: Decoupling Answer Authority to Fix Evidence Misalignment in Long Video QA
Researchers propose VideoSEAL, a framework addressing evidence misalignment in agentic long video understanding, where models produce correct answers unsupported by retrieved evidence. Two diagnostics—temporal and semantic groundedness—reveal prompt pressure from shared-context saturation and reward pressure from outcome-only optimization as root causes. The decoupled planner-inspector paradigm separates long-horizon planning from answer authority.
Key facts
- arXiv:2605.12571
- Long video QA requires locating sparse, time-scattered visual evidence
- Current MLLMs perform well on short videos but struggle with long videos
- Evidence misalignment: correct answers not supported by retrieved evidence
- Two diagnostics: temporal groundedness and semantic groundedness
- Prompt pressure from shared-context saturation at inference time
- Reward pressure from outcome-only optimization during training
- Decoupled planner-inspector framework proposed
Entities
Institutions
- arXiv