LATTE: A Latent Audio Tokenizer for Token-Space Editing

other · 2026-05-13

Researchers propose the Latent Audio Tokenizer for Token-space Editing (LATTE), a novel neural audio codec that appends learnable latent tokens to audio feature sequences, creating a compact non-temporally aligned bottleneck. This design enables token-space interventions, such as swapping token positions between utterances to modify global attributes like speaker identity and background noise, while maintaining competitive reconstruction quality in low-bitrate speech coding.

Key facts

LATTE appends a fixed set of learnable latent tokens to audio feature sequences.
Only latent tokens are retained for quantization and decoding.
The bottleneck is non-temporally aligned and aggregates global information across the full utterance.
Token-space interventions allow swapping latent token positions between utterances.
Swapping tokens modifies global attributes like speaker identity and background noise.
Competitive reconstruction quality is preserved in low-bitrate speech coding settings.

Entities

—

Sources

arXiv cs.AI — 2026-05-13