ARTFEED — Contemporary Art Intelligence

AVBench: New Benchmark for Human-Centric Audio-Video Generation Evaluation

ai-technology · 2026-05-26

Researchers have introduced AVBench, a fully automated benchmark designed for evaluating human-centric audio-video (AV) generation models. The benchmark addresses limitations in existing evaluations, which rely on coarse-grained metrics and generic multimodal LLMs, leading to inaccurate assessments. AVBench integrates ten evaluation dimensions covering visual quality, audio quality, and multi-level consistency across modalities, specifically tailored for human-related scenarios such as speech and interactions. The benchmark aims to provide comprehensive and accurate evaluation of AV generation models, focusing on human-centric details often overlooked by current benchmarks. The work is described in arXiv paper 2605.24652v1.

Key facts

  • AVBench is a fully automated benchmark for human-centric AV generation evaluation.
  • It addresses limitations of existing coarse-grained benchmarks and generic multimodal LLM evaluations.
  • The benchmark integrates ten evaluation dimensions for visual quality, audio quality, and multi-level consistency.
  • It focuses on human-related scenarios including speech and interactions.
  • The work is presented in arXiv paper 2605.24652v1.
  • AVBench aims to provide accurate assessments of model capabilities.
  • Existing benchmarks often miss human-related details.
  • The benchmark is designed for real-world human-centered scenarios.

Entities

Institutions

  • arXiv

Sources