LATTE: A Latent Audio Tokenizer for Token-Space Editing
Researchers propose the Latent Audio Tokenizer for Token-space Editing (LATTE), a novel neural audio codec that appends learnable latent tokens to audio feature sequences, creating a compact non-temporally aligned bottleneck. This design enables token-space interventions, such as swapping token positions between utterances to modify global attributes like speaker identity and background noise, while maintaining competitive reconstruction quality in low-bitrate speech coding.
Key facts
- LATTE appends a fixed set of learnable latent tokens to audio feature sequences.
- Only latent tokens are retained for quantization and decoding.
- The bottleneck is non-temporally aligned and aggregates global information across the full utterance.
- Token-space interventions allow swapping latent token positions between utterances.
- Swapping tokens modifies global attributes like speaker identity and background noise.
- Competitive reconstruction quality is preserved in low-bitrate speech coding settings.
Entities
—