ARTFEED — Contemporary Art Intelligence

VideoSEAL: Decoupling Answer Authority to Fix Evidence Misalignment in Long Video QA

ai-technology · 2026-05-14

Researchers propose VideoSEAL, a framework addressing evidence misalignment in agentic long video understanding, where models produce correct answers unsupported by retrieved evidence. Two diagnostics—temporal and semantic groundedness—reveal prompt pressure from shared-context saturation and reward pressure from outcome-only optimization as root causes. The decoupled planner-inspector paradigm separates long-horizon planning from answer authority.

Key facts

  • arXiv:2605.12571
  • Long video QA requires locating sparse, time-scattered visual evidence
  • Current MLLMs perform well on short videos but struggle with long videos
  • Evidence misalignment: correct answers not supported by retrieved evidence
  • Two diagnostics: temporal groundedness and semantic groundedness
  • Prompt pressure from shared-context saturation at inference time
  • Reward pressure from outcome-only optimization during training
  • Decoupled planner-inspector framework proposed

Entities

Institutions

  • arXiv

Sources