PRISM Benchmark Tests Language Models on Spatial-Temporal Video Generation

ai-technology · 2026-05-20

A new benchmark named PRISM has been developed by researchers to assess the capability of language models in generating animated outputs that are coherent both spatially and temporally through coding. This benchmark features 10,372 instruction-code pairs that have been calibrated by humans, making it 20 times larger than previous benchmarks for programmatic video generation. It is based on real-world knowledge visualization scenarios in both English and Chinese, covering 437 subject categories. PRISM employs a funnel-style evaluation framework with four key metrics: Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD). An evaluation of seven leading LLMs highlighted a notable gap in execution and spatial reasoning, revealing that although models can create executable code, they frequently struggle with producing spatially accurate animations. This benchmark aims to rigorously evaluate the ability of language models to generate spatially correct animated outputs, essential for achieving geometric precision and temporal coherence in programmatic video generation beyond pixel-level diffusion models.

Key facts

PRISM is a benchmark for programmatic spatial-temporal reasoning.
It contains 10,372 human-calibrated instruction-code pairs.
The benchmark is 20 times larger than prior programmatic video generation benchmarks.
It covers English and Chinese across 437 subject categories.
Evaluation framework includes four metrics: Code-Level Reliability, Spatial Reasoning, PADVC, and TD.
Seven mainstream LLMs were systematically evaluated.
A significant execution-spatial reasoning gap was found.
The benchmark is grounded in real-world knowledge visualization scenarios.

Entities

—

Sources

arXiv cs.AI — 2026-05-20