Behavioral Geometry Predicts and Mitigates Jailbreak Susceptibility Across AI Models

ai-technology · 2026-05-27

A new framework formalizes the behavioral geometry of model populations to predict jailbreak susceptibility and transfer defenses efficiently. Applied to 79 models from 24 providers and 100 configurations of a single base model, simple methods achieve an AUPRC of 0.94 for susceptibility detection using approximately 98% fewer probes than full evaluation. Transferring optimized defenses via behavioral geometry outperforms same-provider assignment by 2% (p=0.03) at no additional probe cost. The approach leverages previously evaluated and defended models to avoid impractical per-configuration evaluation.

Key facts

Framework formalizes behavioral geometry of model populations.
Applied to 79 models from 24 providers.
Applied to 100 system configurations of a single base model.
AUPRC of 0.94 for susceptibility detection.
Uses approximately 98% fewer probes than full evaluation.
Transfer via behavioral geometry outperforms same-provider assignment by 2% (p=0.03).
No additional probe cost for defense transfer.
Leverages previously evaluated and defended models.

Entities

—

Sources

arXiv cs.AI — 2026-05-27