SpookyBench Reveals VLMs' Inability to Perceive Temporal Patterns
Researchers have created a new benchmark named SpookyBench to test how well vision-language models (VLMs) understand time-based patterns in videos. Unlike traditional benchmarks that focus on spatial aspects, SpookyBench uses only sequences of noise-like frames to mimic real-life events such as biological signals and secret communications. While humans can accurately recognize shapes and patterns in these sequences over 98% of the time, leading VLMs score a shocking 0%. This highlights a critical issue: VLMs rely too heavily on spatial features and struggle with temporal cues. Moreover, their ability to process temporal information deteriorates faster than humans when faced with low spatial signal-to-noise ratios. The findings underscore a significant shortcoming in current AI models, suggesting the need for improved temporal reasoning. This research appeared on arXiv under the ID 2505.24867.
Key facts
- SpookyBench is a new benchmark for testing temporal understanding in VLMs.
- Information in SpookyBench is encoded in temporal sequences of noise-like frames.
- Humans achieve over 98% accuracy on SpookyBench.
- State-of-the-art VLMs achieve 0% accuracy on SpookyBench.
- VLMs over-rely on spatial features and cannot extract meaning from temporal cues.
- Training on low spatial SNR datasets degrades temporal understanding more rapidly in VLMs than in humans.
- The study was published on arXiv with ID 2505.24867.
- The research highlights a critical limitation in current AI systems.
Entities
Institutions
- arXiv