Rank-Aware Fusion Improves Blended Emotion Recognition

ai-technology · 2026-05-22

A new multi-encoder framework for blended emotion recognition is proposed, which selectively fuses the most informative pre-extracted video and audio encoders. The method projects heterogeneous features into a shared latent space, estimates encoder importance via attention-based gating, and fuses only the top-n encoders. It decouples prediction into presence and salience heads, aligned through probability-level fusion, and incorporates unsupervised domain adaptation for robustness. Experiments on the BlEmoRE challenge show it outperforms strong individual encoders and naive multi-encoder baselines.

Key facts

Proposed rank-aware multi-encoder framework for blended emotion recognition
Selectively fuses top-n most informative pre-extracted video and audio encoders
Projects heterogeneous encoder features into a shared latent space
Estimates sample-wise encoder importance via attention-based gating module
Decouples prediction into presence and salience heads
Aligns heads through probability-level fusion
Incorporates feature-level unsupervised domain adaptation without pseudo-labeling
Outperforms strong individual encoders and naive multi-encoder baselines on BlEmoRE challenge

Entities

—

Sources

arXiv cs.AI — 2026-05-21