AST Framework Enables Training-Free Precise Speech Editing
A novel framework named AST (Adaptive, Seamless, and Training-free) has been created for accurate speech editing. This new approach overcomes the shortcomings of current text-based speech editing techniques, which typically necessitate expensive, task-specific training and may have issues with temporal fidelity in untouched areas. AST utilizes a pre-trained autoregressive Text-to-Speech (TTS) model and introduces a method called Latent Recomposition, allowing for the careful integration of preserved source segments with newly generated target segments. Additionally, it facilitates precise style modifications for particular speech parts. To minimize artifacts at the boundaries of edits, AST employs Adaptive Weak Fact Guidance (AWFG), which adjusts a mel-space guidance signal to maintain structural coherence. This method aims to alter specific speech segments while retaining the speaker's identity and acoustic context, addressing the common trade-offs between editing quality and consistency in TTS adaptations. The findings are detailed in the arXiv preprint 2604.16056v1.
Key facts
- AST is an Adaptive, Seamless, and Training-free precise speech editing framework.
- It uses a pre-trained autoregressive Text-to-Speech (TTS) model.
- Latent Recomposition selectively stitches preserved source segments with newly synthesized targets.
- The framework enables precise style editing for specific speech segments.
- Adaptive Weak Fact Guidance (AWFG) prevents artifacts at edit boundaries.
- AWFG dynamically modulates a mel-space guidance signal.
- Existing methods rely on task-specific training with high data costs.
- The research is documented in arXiv preprint 2604.16056v1.
Entities
—