Self-Supervised Fusion for Deepfake Audio Detection
A new deepfake detection framework uses self-supervised fusion representations to identify manipulated audio in the CompSpoofV2 dataset. The dual-branch approach jointly models speech and environmental sounds using pretrained XLS-R and BEATs models. A Matching Head with statistical normalization and multi-head cross-attention enables information exchange between components. The method was submitted to the ESDD2 2026 challenge.
Key facts
- Submission to Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026
- Uses CompSpoofV2 dataset
- Dual-branch framework for speech and environmental sound
- Pretrained XLS-R for speech, BEATs for environmental sound
- Matching Head with statistical normalization and representation interaction
- Multi-head cross-attention for information exchange
- Residual connections used in processing
- Addresses component-level deepfake detection
Entities
Institutions
- arXiv