EmoMM Benchmark Reveals Video Contribution Collapse in MLLMs for Emotion Recognition
A new benchmark for multimodal emotion recognition (MER), named EmoMM, has been unveiled by researchers, focusing on challenges like modality conflict and missing data. Their findings highlight a phenomenon termed Video Contribution Collapse (VCC), where multimodal large language models (MLLMs) downplay video information due to excessive token redundancy and specific modality biases. To counter this issue, they introduce CHASE (Conflict-aware Head-level Attention Steering), an efficient mechanism that identifies modality conflicts and adjusts attention during inference without the need for retraining. Experimental results indicate that CHASE reliably enhances performance on this benchmark.
Key facts
- EmoMM is a benchmark for multimodal emotion recognition.
- It includes modality-aligned, conflict, and missing subsets.
- Video Contribution Collapse (VCC) occurs in MLLMs.
- VCC is caused by high token redundancy and modality preferences.
- CHASE is a lightweight attention steering mechanism.
- CHASE detects modality conflicts at inference time.
- CHASE does not require retraining the backbone model.
- CHASE consistently improves performance on EmoMM.
Entities
Institutions
- arXiv