VGenST-Bench: New Benchmark for Spatio-Temporal Reasoning in MLLMs
A new video benchmark called VGenST-Bench has been developed by researchers to assess spatio-temporal reasoning in Multimodal Large Language Models (MLLMs). Unlike traditional benchmarks that depend on static images or pre-selected videos, VGenST-Bench leverages generative models to create a wide array of controlled evaluation scenarios actively. The benchmark is built through a multi-agent process that incorporates human quality control to guarantee the production of high-quality videos and question-answer pairs. It presents an extensive video taxonomy categorized into Spatial Scale, Perspective, and Scene Dynamics, along with a hierarchical task suite that separates low-level visual perception from advanced reasoning. The research paper can be found on arXiv with the reference 2605.22570.
Key facts
- VGenST-Bench is a video benchmark for spatio-temporal reasoning in MLLMs.
- It uses generative models to synthesize controlled evaluation scenarios.
- A multi-agent pipeline with human quality control ensures video and QA quality.
- The benchmark includes a 3x2x2 video taxonomy: Spatial Scale, Perspective, Scene Dynamics.
- A hierarchical task suite decouples low-level visual perception from reasoning.
- The paper is published on arXiv with ID 2605.22570.
- Existing benchmarks rely on static images or passively curated videos.
- VGenST-Bench enables evaluation of fine-grained reasoning capabilities.
Entities
Institutions
- arXiv