ARTFEED — Contemporary Art Intelligence

LATTE: A Latent Audio Tokenizer for Token-Space Editing

other · 2026-05-13

Researchers propose the Latent Audio Tokenizer for Token-space Editing (LATTE), a novel neural audio codec that appends learnable latent tokens to audio feature sequences, creating a compact non-temporally aligned bottleneck. This design enables token-space interventions, such as swapping token positions between utterances to modify global attributes like speaker identity and background noise, while maintaining competitive reconstruction quality in low-bitrate speech coding.

Key facts

  • LATTE appends a fixed set of learnable latent tokens to audio feature sequences.
  • Only latent tokens are retained for quantization and decoding.
  • The bottleneck is non-temporally aligned and aggregates global information across the full utterance.
  • Token-space interventions allow swapping latent token positions between utterances.
  • Swapping tokens modifies global attributes like speaker identity and background noise.
  • Competitive reconstruction quality is preserved in low-bitrate speech coding settings.

Entities

Sources