New Benchmark Tests Physical Commonsense in Audio-Video AI

ai-technology · 2026-05-11

A team of researchers has developed a new assessment framework called AV-Phys Bench to evaluate joint audio-video generation models and their grasp of real-world physics. This benchmark focuses on three primary components: Steady State, Event Transition, and Environment Transition, further subdivided into categories based on actual physical principles. It features challenging Anti-AV-Physics prompts that test the limits of models' capabilities. Evaluators consider five key criteria when analyzing outputs. Of the seven models tested, Seedance 2.0 emerged as the top performer, highlighting deficiencies in many models regarding true cross-modal physical coherence versus mere superficial believability.

Key facts

AV-Phys Bench evaluates physical commonsense in joint audio-video generation.
Three scene categories: Steady State, Event Transition, Environment Transition.
Includes Anti-AV-Physics prompts requesting physically inconsistent behavior.
Five evaluation dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, cross-modal physical commonsense.
Tested three proprietary and four open-source models.
Seedance 2.0 performed best overall.
Study reveals models often lack cross-modal physical consistency.
Research published on arXiv (2605.07061).

New Benchmark Tests Physical Commonsense in Audio-Video AI

Key facts

Entities

Institutions

Sources