OSCBench Introduces New Benchmark for Object State Change in Text-to-Video AI Models
A novel benchmark named OSCBench has been created to assess object state changes in text-to-video generation models, filling a notable gap in current evaluations. Object state changes include actions such as peeling a potato or slicing a lemon, which are clearly articulated in text prompts. This benchmark utilizes instructional cooking data, categorizing action-object interactions into regular, novel, and compositional scenarios to evaluate both in-distribution performance and generalization. Human user studies have been conducted on six representative open-source and proprietary T2V models. Previous benchmarks have mainly concentrated on perceptual quality, text-video alignment, or physical plausibility, neglecting action comprehension. The paper detailing OSCBench can be found on arXiv with the identifier arXiv:2603.11698v2, categorized as replace-cross.
Key facts
- OSCBench is a benchmark for assessing object state change in text-to-video generation models.
- Object state change involves transformations like peeling a potato or slicing a lemon.
- The benchmark uses instructional cooking data for its construction.
- Action-object interactions are organized into regular, novel, and compositional scenarios.
- Six open-source and proprietary T2V models are evaluated with human user studies.
- Existing benchmarks focus on perceptual quality, text-video alignment, or physical plausibility.
- The paper is available on arXiv under arXiv:2603.11698v2.
- Text-to-video models have made rapid progress in visual quality and temporal coherence.
Entities
Institutions
- arXiv