Mapping Failure Manifolds in Large Language Models
A new framework systematically maps the 'Manifold of Failure' in LLMs, treating vulnerability search as a quality diversity problem. Using MAP-Elites, researchers identify behavioral attraction basins and measure alignment deviation. Tested on Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, the method achieves up to 63% behavioral coverage and discovers up to 370 distinct vulnerability niches, revealing model-specific topological signatures.
Key facts
- Framework maps the Manifold of Failure in LLMs
- Reframes vulnerability search as a quality diversity problem using MAP-Elites
- Introduces Alignment Deviation as quality metric
- Tested on Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini
- Achieves up to 63% behavioral coverage
- Discovers up to 370 distinct vulnerability niches
- Reveals model-specific topological signatures
- Published on arXiv (2602.22291)
Entities
Institutions
- arXiv