Anchor Invariance Regularization for LLM Safety Alignment
A new arXiv paper proposes Anchor Invariance Regularization (AIR) to improve safety alignment in large language models by enforcing context-invariant behavior. Current preference-based post-training often fails when harmful requests are reworded adversarially. AIR treats verifiable prompts (e.g., multiple-choice) as anchors to regularize open-ended variants, avoiding performance degradation on reliable signals. The approach aims to make safety decisions depend on underlying intent rather than surface form.
Key facts
- Paper arXiv:2605.20994v1 proposes Anchor Invariance Regularization (AIR)
- Addresses context-invariant safety alignment for LLMs
- Current safety behavior is brittle under adversarial wording
- AIR treats verifiable prompts as anchors
- Aims to reduce cross-context discrepancies without lowering performance on reliable variants
Entities
—