StoryTR: AI Benchmark for Narrative Video Retrieval Using Theory of Mind

ai-technology · 2026-04-29

A novel benchmark known as StoryTR has been launched to overcome the shortcomings of existing video moment retrieval models in comprehending narrative elements. Although current models perform well in action-focused tasks, they struggle with understanding the significance of events due to their lack of Theory of Mind (ToM)—the ability to deduce implicit intentions, mental states, and causal relationships in narratives. StoryTR is the inaugural video moment retrieval benchmark that necessitates ToM reasoning, featuring 8.1k samples from narrative short-form videos like shorts and reels. These videos convey meaning through nuanced multimodal signals, where a character's 'smiling' may imply 'concealing hostility.' Further details can be found in a paper available on arXiv (2604.23198).

Key facts

StoryTR is the first video moment retrieval benchmark requiring Theory of Mind reasoning.
It comprises 8.1k samples from narrative short-form videos (shorts/reels).
Current models can see what is happening but fail to reason why it matters.
Theory of Mind is the cognitive ability to infer implicit intentions and mental states.
Narrative videos encode meaning through subtle multimodal cues.
A glance paired with a sigh carries different semantics than the glance alone.
The benchmark teaches models to decode that 'smiling' may conceal hostility.
The paper is available on arXiv with identifier 2604.23198.

StoryTR: AI Benchmark for Narrative Video Retrieval Using Theory of Mind

Key facts

Entities

Institutions

Sources