MTAVG-Bench 2.0: Benchmark for Cinematic Expressiveness in Multi-Talker Video Generation

other · 2026-05-28

Researchers have introduced MTAVG-Bench 2.0, a benchmark designed to diagnose failure modes of cinematic expressiveness in multi-talker audio-video generation (MTAVG) models. While current models perform well on basic metrics like lip-sync and audio-visual alignment, these are insufficient for assessing higher-level cinematic qualities in multi-character scenes. The new benchmark targets short-drama and scene-level generation, establishing a taxonomy of failures across acting, narrative, atmosphere, and audio-visual language. It includes over 10,000 question-answering evaluation instances. The work is published on arXiv under the identifier 2605.28035.

Key facts

MTAVG-Bench 2.0 is a new benchmark for cinematic expressiveness in multi-talker audio-video generation.
Current models show promising performance on fundamental metrics like lip-sync and audio-visual alignment.
Existing metrics are insufficient for assessing cinematic expressiveness in scene-level generation.
The benchmark targets short-drama and scene-level generation.
It establishes a high-level failure taxonomy covering acting, narrative, atmosphere, and audio-visual language.
The benchmark includes more than 10,000 question-answering evaluation instances.
The work is published on arXiv with identifier 2605.28035.
The benchmark diagnoses failure modes of cinematic expressiveness.

MTAVG-Bench 2.0: Benchmark for Cinematic Expressiveness in Multi-Talker Video Generation

Key facts

Entities

Institutions

Sources