ARTFEED — Contemporary Art Intelligence

Cross-Lingual Jailbreak Detection via Semantic Codebooks

ai-technology · 2026-04-30

A recent study published on arXiv (2604.25716) introduces a method for detecting cross-lingual jailbreak attacks on large language models (LLMs) that does not require training. This technique leverages language-agnostic semantic similarity by matching multilingual query embeddings with a static English codebook of jailbreak prompts, serving as an external safeguard for black-box LLMs. The researchers tested their approach across four languages, two translation pipelines, four safety benchmarks, and three embedding models, focusing on three target LLMs: Qwen, Llama, and GPT-3.5. Findings indicate two distinct cross-lingual transfer regimes, with curated benchmarks revealing typical jailbreak patterns. This research highlights a critical security vulnerability where English-focused safety measures fall short in multilingual contexts, as previous studies indicated that translating harmful prompts into other languages enhances jailbreak success rates.

Key facts

  • arXiv paper 2604.25716 proposes a cross-lingual jailbreak detection method
  • Method uses language-agnostic semantic similarity with a fixed English codebook
  • Approach is training-free and operates as an external guardrail for black-box LLMs
  • Evaluation covers four languages, two translation pipelines, four safety benchmarks, three embedding models
  • Target LLMs include Qwen, Llama, and GPT-3.5
  • Results show two distinct regimes of cross-lingual transfer
  • Prior work shows translating malicious prompts increases jailbreak success rates
  • Addresses English-centric safety mechanism vulnerabilities in multilingual deployment

Entities

Institutions

  • arXiv

Sources