OSCBench Introduces New Benchmark for Object State Change in Text-to-Video AI Models

ai-technology · 2026-04-20

A novel benchmark named OSCBench has been created to assess object state changes in text-to-video generation models, filling a notable gap in current evaluations. Object state changes include actions such as peeling a potato or slicing a lemon, which are clearly articulated in text prompts. This benchmark utilizes instructional cooking data, categorizing action-object interactions into regular, novel, and compositional scenarios to evaluate both in-distribution performance and generalization. Human user studies have been conducted on six representative open-source and proprietary T2V models. Previous benchmarks have mainly concentrated on perceptual quality, text-video alignment, or physical plausibility, neglecting action comprehension. The paper detailing OSCBench can be found on arXiv with the identifier arXiv:2603.11698v2, categorized as replace-cross.

Key facts

OSCBench is a benchmark for assessing object state change in text-to-video generation models.
Object state change involves transformations like peeling a potato or slicing a lemon.
The benchmark uses instructional cooking data for its construction.
Action-object interactions are organized into regular, novel, and compositional scenarios.
Six open-source and proprietary T2V models are evaluated with human user studies.
Existing benchmarks focus on perceptual quality, text-video alignment, or physical plausibility.
The paper is available on arXiv under arXiv:2603.11698v2.
Text-to-video models have made rapid progress in visual quality and temporal coherence.

OSCBench Introduces New Benchmark for Object State Change in Text-to-Video AI Models

Key facts

Entities

Institutions

Sources