EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
A new AI model called EAD-Net (Emotion-Aware Diffusion Network) has been proposed for generating emotionally expressive talking head videos. The system addresses key challenges in current methods: insufficient semantic information from simple emotional labels, lip-sync degradation when introducing high-level semantics, and poor temporal coherence in long videos. EAD-Net incorporates SyncNet supervision and Temporal Representation Alignment (TREPA) to maintain lip synchronization during multi-modal fusion. A Spatio-Temporal Directional Attention (STDA) mechanism models complex spatio-temporal dependencies in long sequences. The research was published on arXiv (2604.23325) as a cross-type announcement.
Key facts
- EAD-Net stands for Emotion-Aware Diffusion Network
- It generates talking head videos with emotional facial expressions and accurate lip synchronization
- Current methods rely on simple emotional labels with insufficient semantic information
- High-level semantics improve expressiveness but cause lip-sync degradation
- SyncNet supervision and TREPA mitigate lip-sync degradation
- STDA mechanism captures spatio-temporal dependencies in long videos
- The paper is available on arXiv with ID 2604.23325
- Announcement type is cross
Entities
Institutions
- arXiv