LogiBreak: Logical Expressions Bypass LLM Safety Restrictions
Researchers have introduced LogiBreak, a novel black-box jailbreak method that converts harmful natural language prompts into formal logical expressions to circumvent safety systems in large language models (LLMs). The method exploits distributional discrepancies between alignment-oriented prompts and logic-based inputs, preserving semantic intent and readability while evading safety constraints. LogiBreak was evaluated on a multilingual jailbreak dataset spanning three languages, demonstrating effectiveness across various evaluation settings and linguistic contexts. The research is published on arXiv under computer science category.
Key facts
- LogiBreak is a black-box jailbreak method for LLMs.
- It converts harmful prompts into formal logical expressions.
- It exploits distributional gaps between alignment data and logic-based inputs.
- Preserves semantic intent and readability.
- Evaluated on a multilingual jailbreak dataset across three languages.
- Demonstrates effectiveness in various evaluation settings.
- Published on arXiv (2505.13527).
- Research is in computer science (Computation and Language).
Entities
Institutions
- arXiv