Sparse Token Attack Bypasses Audio Language Model Safety
Researchers have developed a sparse jailbreak method for audio language models (ALMs) that achieves high attack success rates while updating only a fraction of the audio waveform. The work, titled 'Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization', was published on arXiv (ID: 2605.04700). The team discovered that gradient energy in ALMs is concentrated on a small subset of token-aligned audio regions. Based on this, they proposed Token-Aware Gradient Optimization (TAGO), which masks low-energy gradients at each iteration, retaining only those aligned with high-energy tokens. Testing on three ALMs, TAGO outperformed baselines. Notably, on Qwen3-Omni, the attack success rate (ASR_l) remained at 86% even with substantial sparsification. This challenges the assumption that dense waveform perturbation is necessary for jailbreaking, revealing a vulnerability in current audio safety mechanisms.
Key facts
- arXiv paper ID: 2605.04700
- Jailbreak attacks on ALMs typically update entire waveform densely
- Gradient energy is highly non-uniform across audio tokens
- TAGO enables sparse optimization by masking low-energy gradients
- TAGO tested on three ALMs, outperforms baselines
- On Qwen3-Omni, ASR_l remains 86% with substantial sparsification
- Method reveals vulnerability in audio safety mechanisms
Entities
Institutions
- arXiv