KARMA-MV: Benchmarking Causal Reasoning in Music Videos
A new dataset called KARMA-MV has been developed by researchers, consisting of 2,682 YouTube music videos. This extensive multiple-choice QA dataset aims to evaluate models on their capacity to combine temporal audio-visual cues and analyze the influence of visuals on music through reasoning, prediction, and counterfactual inquiries. In contrast to conventional datasets that rely on manual annotations, KARMA-MV utilizes LLM reasoning for efficient generation and validation, resulting in 37,737 multiple-choice questions. The researchers introduce a causal knowledge graph (CKG) method that enhances vision-language models (VLMs) by enabling structured retrieval of cross-modal relationships. Results from experiments with advanced VLMs and LLMs indicate significant improvements from CKG grounding, particularly for smaller models. The study is available on arXiv under ID 2605.08175.
Key facts
- KARMA-MV is a benchmark for causal question answering on music videos.
- Derived from 2,682 YouTube music videos.
- Contains 37,737 multiple-choice questions.
- Tests reasoning, prediction, and counterfactual questions.
- Uses LLM reasoning for scalable generation and validation.
- Proposes a causal knowledge graph (CKG) approach.
- CKG augments vision-language models with structured retrieval.
- Experiments show gains from CKG grounding, especially for smaller models.
Entities
Institutions
- arXiv