Rank-Aware Fusion Improves Blended Emotion Recognition
A new multi-encoder framework for blended emotion recognition is proposed, which selectively fuses the most informative pre-extracted video and audio encoders. The method projects heterogeneous features into a shared latent space, estimates encoder importance via attention-based gating, and fuses only the top-n encoders. It decouples prediction into presence and salience heads, aligned through probability-level fusion, and incorporates unsupervised domain adaptation for robustness. Experiments on the BlEmoRE challenge show it outperforms strong individual encoders and naive multi-encoder baselines.
Key facts
- Proposed rank-aware multi-encoder framework for blended emotion recognition
- Selectively fuses top-n most informative pre-extracted video and audio encoders
- Projects heterogeneous encoder features into a shared latent space
- Estimates sample-wise encoder importance via attention-based gating module
- Decouples prediction into presence and salience heads
- Aligns heads through probability-level fusion
- Incorporates feature-level unsupervised domain adaptation without pseudo-labeling
- Outperforms strong individual encoders and naive multi-encoder baselines on BlEmoRE challenge
Entities
—