ARTFEED — Contemporary Art Intelligence

Behavioral Geometry Predicts and Mitigates Jailbreak Susceptibility Across AI Models

ai-technology · 2026-05-27

A new framework formalizes the behavioral geometry of model populations to predict jailbreak susceptibility and transfer defenses efficiently. Applied to 79 models from 24 providers and 100 configurations of a single base model, simple methods achieve an AUPRC of 0.94 for susceptibility detection using approximately 98% fewer probes than full evaluation. Transferring optimized defenses via behavioral geometry outperforms same-provider assignment by 2% (p=0.03) at no additional probe cost. The approach leverages previously evaluated and defended models to avoid impractical per-configuration evaluation.

Key facts

  • Framework formalizes behavioral geometry of model populations.
  • Applied to 79 models from 24 providers.
  • Applied to 100 system configurations of a single base model.
  • AUPRC of 0.94 for susceptibility detection.
  • Uses approximately 98% fewer probes than full evaluation.
  • Transfer via behavioral geometry outperforms same-provider assignment by 2% (p=0.03).
  • No additional probe cost for defense transfer.
  • Leverages previously evaluated and defended models.

Entities

Sources