MTAVG-Bench 2.0: Benchmark for Cinematic Expressiveness in Multi-Talker Video Generation
Researchers have introduced MTAVG-Bench 2.0, a benchmark designed to diagnose failure modes of cinematic expressiveness in multi-talker audio-video generation (MTAVG) models. While current models perform well on basic metrics like lip-sync and audio-visual alignment, these are insufficient for assessing higher-level cinematic qualities in multi-character scenes. The new benchmark targets short-drama and scene-level generation, establishing a taxonomy of failures across acting, narrative, atmosphere, and audio-visual language. It includes over 10,000 question-answering evaluation instances. The work is published on arXiv under the identifier 2605.28035.
Key facts
- MTAVG-Bench 2.0 is a new benchmark for cinematic expressiveness in multi-talker audio-video generation.
- Current models show promising performance on fundamental metrics like lip-sync and audio-visual alignment.
- Existing metrics are insufficient for assessing cinematic expressiveness in scene-level generation.
- The benchmark targets short-drama and scene-level generation.
- It establishes a high-level failure taxonomy covering acting, narrative, atmosphere, and audio-visual language.
- The benchmark includes more than 10,000 question-answering evaluation instances.
- The work is published on arXiv with identifier 2605.28035.
- The benchmark diagnoses failure modes of cinematic expressiveness.
Entities
Institutions
- arXiv