ARTFEED — Contemporary Art Intelligence

Self-Supervised Fusion for Deepfake Audio Detection

ai-technology · 2026-05-07

A new deepfake detection framework uses self-supervised fusion representations to identify manipulated audio in the CompSpoofV2 dataset. The dual-branch approach jointly models speech and environmental sounds using pretrained XLS-R and BEATs models. A Matching Head with statistical normalization and multi-head cross-attention enables information exchange between components. The method was submitted to the ESDD2 2026 challenge.

Key facts

  • Submission to Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026
  • Uses CompSpoofV2 dataset
  • Dual-branch framework for speech and environmental sound
  • Pretrained XLS-R for speech, BEATs for environmental sound
  • Matching Head with statistical normalization and representation interaction
  • Multi-head cross-attention for information exchange
  • Residual connections used in processing
  • Addresses component-level deepfake detection

Entities

Institutions

  • arXiv

Sources