SpookyBench Reveals VLMs' Inability to Perceive Temporal Patterns

ai-technology · 2026-04-30

Researchers have created a new benchmark named SpookyBench to test how well vision-language models (VLMs) understand time-based patterns in videos. Unlike traditional benchmarks that focus on spatial aspects, SpookyBench uses only sequences of noise-like frames to mimic real-life events such as biological signals and secret communications. While humans can accurately recognize shapes and patterns in these sequences over 98% of the time, leading VLMs score a shocking 0%. This highlights a critical issue: VLMs rely too heavily on spatial features and struggle with temporal cues. Moreover, their ability to process temporal information deteriorates faster than humans when faced with low spatial signal-to-noise ratios. The findings underscore a significant shortcoming in current AI models, suggesting the need for improved temporal reasoning. This research appeared on arXiv under the ID 2505.24867.

Key facts

SpookyBench is a new benchmark for testing temporal understanding in VLMs.
Information in SpookyBench is encoded in temporal sequences of noise-like frames.
Humans achieve over 98% accuracy on SpookyBench.
State-of-the-art VLMs achieve 0% accuracy on SpookyBench.
VLMs over-rely on spatial features and cannot extract meaning from temporal cues.
Training on low spatial SNR datasets degrades temporal understanding more rapidly in VLMs than in humans.
The study was published on arXiv with ID 2505.24867.
The research highlights a critical limitation in current AI systems.

SpookyBench Reveals VLMs' Inability to Perceive Temporal Patterns

Key facts

Entities

Institutions

Sources