Deep-Learning Framework Proposed for Environmental Sound Deepfake Detection
A deep-learning framework for environmental sound deepfake detection (ESDD) has been introduced, focusing on identifying whether audio recordings contain fake sound scenes or events. Extensive experiments explored the impact of individual spectrograms, various network architectures, pre-trained models, and ensembles on ESDD task performance. Results from the EnvSDD and ESDD-Challenge-TestSet benchmark datasets suggest that detecting deepfake audio for sound scenes and sound events should be treated as separate tasks. The approach of fine-tuning a pre-trained model proved more effective than training from scratch for the ESDD task. The best-performing model was fine-tuned from the pre-trained WavLM model using a proposed three-stage training strategy. This research addresses the growing concern of audio deepfakes in environmental contexts, providing a methodological foundation for detection. The paper is available on arXiv under the identifier 2604.19652v1, categorized as a cross announcement. The work contributes to the field of audio forensics by offering a specialized framework for environmental sound verification.
Key facts
- A deep-learning framework for environmental sound deepfake detection (ESDD) is proposed.
- Extensive experiments were conducted using spectrograms, network architectures, and pre-trained models.
- Benchmark datasets used include EnvSDD and ESDD-Challenge-TestSet.
- Detecting deepfake audio for sound scenes and sound events should be considered individual tasks.
- Fine-tuning a pre-trained model is more effective than training from scratch for ESDD.
- The best model was fine-tuned from the pre-trained WavLM model.
- A three-stage training strategy was proposed for the model.
- The paper is available on arXiv under the identifier 2604.19652v1.
Entities
Institutions
- arXiv