AVBench: New Benchmark for Human-Centric Audio-Video Generation Evaluation

ai-technology · 2026-05-26

Researchers have introduced AVBench, a fully automated benchmark designed for evaluating human-centric audio-video (AV) generation models. The benchmark addresses limitations in existing evaluations, which rely on coarse-grained metrics and generic multimodal LLMs, leading to inaccurate assessments. AVBench integrates ten evaluation dimensions covering visual quality, audio quality, and multi-level consistency across modalities, specifically tailored for human-related scenarios such as speech and interactions. The benchmark aims to provide comprehensive and accurate evaluation of AV generation models, focusing on human-centric details often overlooked by current benchmarks. The work is described in arXiv paper 2605.24652v1.

Key facts

AVBench is a fully automated benchmark for human-centric AV generation evaluation.
It addresses limitations of existing coarse-grained benchmarks and generic multimodal LLM evaluations.
The benchmark integrates ten evaluation dimensions for visual quality, audio quality, and multi-level consistency.
It focuses on human-related scenarios including speech and interactions.
The work is presented in arXiv paper 2605.24652v1.
AVBench aims to provide accurate assessments of model capabilities.
Existing benchmarks often miss human-related details.
The benchmark is designed for real-world human-centered scenarios.

AVBench: New Benchmark for Human-Centric Audio-Video Generation Evaluation

Key facts

Entities

Institutions

Sources