Behavioral Geometry Predicts and Mitigates Jailbreak Susceptibility Across AI Models
A new framework formalizes the behavioral geometry of model populations to predict jailbreak susceptibility and transfer defenses efficiently. Applied to 79 models from 24 providers and 100 configurations of a single base model, simple methods achieve an AUPRC of 0.94 for susceptibility detection using approximately 98% fewer probes than full evaluation. Transferring optimized defenses via behavioral geometry outperforms same-provider assignment by 2% (p=0.03) at no additional probe cost. The approach leverages previously evaluated and defended models to avoid impractical per-configuration evaluation.
Key facts
- Framework formalizes behavioral geometry of model populations.
- Applied to 79 models from 24 providers.
- Applied to 100 system configurations of a single base model.
- AUPRC of 0.94 for susceptibility detection.
- Uses approximately 98% fewer probes than full evaluation.
- Transfer via behavioral geometry outperforms same-provider assignment by 2% (p=0.03).
- No additional probe cost for defense transfer.
- Leverages previously evaluated and defended models.
Entities
—