Cross-Lingual Jailbreak Detection via Semantic Codebooks
A recent study published on arXiv (2604.25716) introduces a method for detecting cross-lingual jailbreak attacks on large language models (LLMs) that does not require training. This technique leverages language-agnostic semantic similarity by matching multilingual query embeddings with a static English codebook of jailbreak prompts, serving as an external safeguard for black-box LLMs. The researchers tested their approach across four languages, two translation pipelines, four safety benchmarks, and three embedding models, focusing on three target LLMs: Qwen, Llama, and GPT-3.5. Findings indicate two distinct cross-lingual transfer regimes, with curated benchmarks revealing typical jailbreak patterns. This research highlights a critical security vulnerability where English-focused safety measures fall short in multilingual contexts, as previous studies indicated that translating harmful prompts into other languages enhances jailbreak success rates.
Key facts
- arXiv paper 2604.25716 proposes a cross-lingual jailbreak detection method
- Method uses language-agnostic semantic similarity with a fixed English codebook
- Approach is training-free and operates as an external guardrail for black-box LLMs
- Evaluation covers four languages, two translation pipelines, four safety benchmarks, three embedding models
- Target LLMs include Qwen, Llama, and GPT-3.5
- Results show two distinct regimes of cross-lingual transfer
- Prior work shows translating malicious prompts increases jailbreak success rates
- Addresses English-centric safety mechanism vulnerabilities in multilingual deployment
Entities
Institutions
- arXiv