New Framework Targets Audio-Visual Deepfake Detection in Singing
Researchers have identified a critical weakness in existing audio-visual deepfake detection methods: they fail when applied to singing. Unlike talking, where lip movements and audio are tightly synchronized, singing involves rhythmic vocalization that weakens this cross-modal coupling, causing detection performance to drop significantly. To address this, the team constructed the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models, filling a gap in available benchmarks. They also propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework designed to generalize across both talking and singing scenarios. T-AVFD includes a facial authenticity pattern learner that aligns facial features with multi-granularity textual descriptions, and a multi-modal differential weight learning module that preserves intrinsic features. The work highlights a domain shift problem in deepfake detection and offers a solution that leverages textual guidance to learn generalizable authenticity patterns. The paper is published on arXiv under identifier 2605.27944.
Key facts
- Existing audio-visual deepfake detection methods rely on cross-modal inconsistencies.
- Singing weakens the coupling between audio and video, causing a domain shift.
- Detection performance degrades substantially for singing content.
- The Singing Head DeepFake (SHDF) dataset was created using rhythm-aware generative models.
- The T-AVFD framework is proposed to handle both talking and singing scenarios.
- T-AVFD includes a facial authenticity pattern learner and a multi-modal differential weight learning module.
- The pattern learner aligns facial features with multi-granularity textual descriptions.
- The paper is available on arXiv with ID 2605.27944.
Entities
Institutions
- arXiv