VGenST-Bench: New Benchmark for Spatio-Temporal Reasoning in MLLMs

publication · 2026-05-23

A new video benchmark called VGenST-Bench has been developed by researchers to assess spatio-temporal reasoning in Multimodal Large Language Models (MLLMs). Unlike traditional benchmarks that depend on static images or pre-selected videos, VGenST-Bench leverages generative models to create a wide array of controlled evaluation scenarios actively. The benchmark is built through a multi-agent process that incorporates human quality control to guarantee the production of high-quality videos and question-answer pairs. It presents an extensive video taxonomy categorized into Spatial Scale, Perspective, and Scene Dynamics, along with a hierarchical task suite that separates low-level visual perception from advanced reasoning. The research paper can be found on arXiv with the reference 2605.22570.

Key facts

VGenST-Bench is a video benchmark for spatio-temporal reasoning in MLLMs.
It uses generative models to synthesize controlled evaluation scenarios.
A multi-agent pipeline with human quality control ensures video and QA quality.
The benchmark includes a 3x2x2 video taxonomy: Spatial Scale, Perspective, Scene Dynamics.
A hierarchical task suite decouples low-level visual perception from reasoning.
The paper is published on arXiv with ID 2605.22570.
Existing benchmarks rely on static images or passively curated videos.
VGenST-Bench enables evaluation of fine-grained reasoning capabilities.

VGenST-Bench: New Benchmark for Spatio-Temporal Reasoning in MLLMs

Key facts

Entities

Institutions

Sources