AVBench: New Benchmark for Human-Centric Audio-Video Generation Evaluation
Researchers have introduced AVBench, a fully automated benchmark designed for evaluating human-centric audio-video (AV) generation models. The benchmark addresses limitations in existing evaluations, which rely on coarse-grained metrics and generic multimodal LLMs, leading to inaccurate assessments. AVBench integrates ten evaluation dimensions covering visual quality, audio quality, and multi-level consistency across modalities, specifically tailored for human-related scenarios such as speech and interactions. The benchmark aims to provide comprehensive and accurate evaluation of AV generation models, focusing on human-centric details often overlooked by current benchmarks. The work is described in arXiv paper 2605.24652v1.
Key facts
- AVBench is a fully automated benchmark for human-centric AV generation evaluation.
- It addresses limitations of existing coarse-grained benchmarks and generic multimodal LLM evaluations.
- The benchmark integrates ten evaluation dimensions for visual quality, audio quality, and multi-level consistency.
- It focuses on human-related scenarios including speech and interactions.
- The work is presented in arXiv paper 2605.24652v1.
- AVBench aims to provide accurate assessments of model capabilities.
- Existing benchmarks often miss human-related details.
- The benchmark is designed for real-world human-centered scenarios.
Entities
Institutions
- arXiv