New Framework Targets Audio-Visual Deepfake Detection in Singing

ai-technology · 2026-05-28

Researchers have identified a critical weakness in existing audio-visual deepfake detection methods: they fail when applied to singing. Unlike talking, where lip movements and audio are tightly synchronized, singing involves rhythmic vocalization that weakens this cross-modal coupling, causing detection performance to drop significantly. To address this, the team constructed the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models, filling a gap in available benchmarks. They also propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework designed to generalize across both talking and singing scenarios. T-AVFD includes a facial authenticity pattern learner that aligns facial features with multi-granularity textual descriptions, and a multi-modal differential weight learning module that preserves intrinsic features. The work highlights a domain shift problem in deepfake detection and offers a solution that leverages textual guidance to learn generalizable authenticity patterns. The paper is published on arXiv under identifier 2605.27944.

Key facts

Existing audio-visual deepfake detection methods rely on cross-modal inconsistencies.
Singing weakens the coupling between audio and video, causing a domain shift.
Detection performance degrades substantially for singing content.
The Singing Head DeepFake (SHDF) dataset was created using rhythm-aware generative models.
The T-AVFD framework is proposed to handle both talking and singing scenarios.
T-AVFD includes a facial authenticity pattern learner and a multi-modal differential weight learning module.
The pattern learner aligns facial features with multi-granularity textual descriptions.
The paper is available on arXiv with ID 2605.27944.

New Framework Targets Audio-Visual Deepfake Detection in Singing

Key facts

Entities

Institutions

Sources