Anchor Invariance Regularization for LLM Safety Alignment

ai-technology · 2026-05-22

A new arXiv paper proposes Anchor Invariance Regularization (AIR) to improve safety alignment in large language models by enforcing context-invariant behavior. Current preference-based post-training often fails when harmful requests are reworded adversarially. AIR treats verifiable prompts (e.g., multiple-choice) as anchors to regularize open-ended variants, avoiding performance degradation on reliable signals. The approach aims to make safety decisions depend on underlying intent rather than surface form.

Key facts

Paper arXiv:2605.20994v1 proposes Anchor Invariance Regularization (AIR)
Addresses context-invariant safety alignment for LLMs
Current safety behavior is brittle under adversarial wording
AIR treats verifiable prompts as anchors
Aims to reduce cross-context discrepancies without lowering performance on reliable variants

Entities

—

Sources

arXiv cs.AI — 2026-05-21