ARTFEED — Contemporary Art Intelligence

MTAVG-Bench 2.0: Benchmark for Cinematic Expressiveness in Multi-Talker Video Generation

other · 2026-05-28

Researchers have introduced MTAVG-Bench 2.0, a benchmark designed to diagnose failure modes of cinematic expressiveness in multi-talker audio-video generation (MTAVG) models. While current models perform well on basic metrics like lip-sync and audio-visual alignment, these are insufficient for assessing higher-level cinematic qualities in multi-character scenes. The new benchmark targets short-drama and scene-level generation, establishing a taxonomy of failures across acting, narrative, atmosphere, and audio-visual language. It includes over 10,000 question-answering evaluation instances. The work is published on arXiv under the identifier 2605.28035.

Key facts

  • MTAVG-Bench 2.0 is a new benchmark for cinematic expressiveness in multi-talker audio-video generation.
  • Current models show promising performance on fundamental metrics like lip-sync and audio-visual alignment.
  • Existing metrics are insufficient for assessing cinematic expressiveness in scene-level generation.
  • The benchmark targets short-drama and scene-level generation.
  • It establishes a high-level failure taxonomy covering acting, narrative, atmosphere, and audio-visual language.
  • The benchmark includes more than 10,000 question-answering evaluation instances.
  • The work is published on arXiv with identifier 2605.28035.
  • The benchmark diagnoses failure modes of cinematic expressiveness.

Entities

Institutions

  • arXiv

Sources