Mathematical Encoding Bypasses LLM Safety Filters with 56% Success
A new study reveals that encoding harmful prompts as mathematical problems—using set theory, formal logic, and quantum mechanics—bypasses LLM safety filters with 46%–56% average attack success across eight models. The key factor is deep reformulation into genuine mathematical problems, not mere formatting. The research introduces a Formal Logic encoding achieving comparable success to Set Theory, showing the vulnerability generalizes across formalisms.
Key facts
- Harmful prompts encoded as mathematical problems bypass LLM safety filters at 46%–56% average attack success.
- Eight target models and two benchmarks were tested.
- Effectiveness depends on deep reformulation into genuine mathematical problems, not just mathematical notation.
- Rule-based encodings without reformulation perform no better than unencoded baselines.
- A novel Formal Logic encoding achieves attack success comparable to Set Theory.
- The vulnerability generalizes across mathematical formalisms including set theory, formal logic, and quantum mechanics.
- The study is published on arXiv with ID 2605.03441.
- LLM safety mechanisms primarily rely on semantic pattern matching, which this attack exploits.
Entities
Institutions
- arXiv