Constrained Decoding Attack: New Jailbreak Targets LLM Structured Output APIs
Researchers have identified a new class of jailbreak attack targeting Large Language Models (LLMs) that use structured output APIs. The Constrained Decoding Attack (CDA) exploits grammar-guided decoding, a control-plane feature that enforces output schemas. Unlike traditional data-plane attacks that bypass alignment through input manipulation, CDA injects malicious prefixes via schema-enforced logit masking during the decoding process, causing the model to complete harmful content. This attack cannot be stopped by internal safety alignment alone. The paper introduces EnumAttack as an instantiation of CDA, which hides malicious content in enumeration fields. The findings were published on arXiv (2503.24191) and highlight a critical vulnerability in LLM tooling platforms.
Key facts
- CDA is a new jailbreak class targeting the LLM control plane.
- Attack exploits grammar-guided decoding in structured output APIs.
- CDA uses schema-enforced logit masking to inject malicious prefixes.
- Unlike data-plane jailbreaks, CDA acts on the decoding process itself.
- Internal safety alignment cannot stop CDA.
- EnumAttack is an instantiation of CDA.
- Paper published on arXiv with ID 2503.24191.
- Attack surface is orthogonal to traditional data-plane vulnerabilities.
Entities
Institutions
- arXiv