Latent Personality Alignment Boosts LLM Safety Without Harmful Examples
A recent study published on arXiv (2605.08496) presents Latent Personality Alignment (LPA), a defense mechanism for large language models that enhances resilience by focusing on abstract personality traits rather than specific harmful actions. LPA utilizes fewer than 100 trait statements and employs latent adversarial training, achieving comparable attack success rates to methods trained on over 150,000 examples while providing better overall utility. Notably, LPA exhibits improved generalization to previously unseen attack distributions, decreasing misclassification rates by 2.6 times across six harm benchmarks, all without exposure to harmful examples during training. The authors suggest that aligning models with personality traits represents a principled strategy for developing robust defenses with limited data.
Key facts
- arXiv paper 2605.08496 proposes Latent Personality Alignment (LPA).
- LPA uses fewer than 100 trait statements for training.
- LPA achieves comparable attack success rates to methods trained on 150k+ examples.
- LPA reduces misclassification rates by 2.6x across six harm benchmarks.
- LPA never sees harmful examples during training.
- LPA generalizes better to unseen attack distributions.
- Current adversarial robustness methods require thousands to hundreds of thousands of harmful prompts.
- LPA maintains superior utility compared to baseline methods.
Entities
Institutions
- arXiv