Self-Supervised Fusion for Deepfake Audio Detection

ai-technology · 2026-05-07

A new deepfake detection framework uses self-supervised fusion representations to identify manipulated audio in the CompSpoofV2 dataset. The dual-branch approach jointly models speech and environmental sounds using pretrained XLS-R and BEATs models. A Matching Head with statistical normalization and multi-head cross-attention enables information exchange between components. The method was submitted to the ESDD2 2026 challenge.

Key facts

Submission to Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026
Uses CompSpoofV2 dataset
Dual-branch framework for speech and environmental sound
Pretrained XLS-R for speech, BEATs for environmental sound
Matching Head with statistical normalization and representation interaction
Multi-head cross-attention for information exchange
Residual connections used in processing
Addresses component-level deepfake detection

Self-Supervised Fusion for Deepfake Audio Detection

Key facts

Entities

Institutions

Sources