Curriculum Learning Boosts Safety Alignment in LLMs

ai-technology · 2026-05-27

A recent paper on arXiv (2605.26315) presents Staged-Competence, a framework for curriculum learning aimed at enhancing Direct Preference Optimisation (DPO) for safety alignment in large language models. This approach categorizes preference data based on difficulty, employs competence-based sampling, and incrementally updates the reference model. Staged-Competence achieves a 16% reduction in harmful response rates for out-of-distribution scenarios and a 20% decrease in the success rates of jailbreak attacks across three model families, all while preserving general capabilities with nearly no over-refusal. It achieves baseline safety standards using only 75% of the training data and creates a clearer distinction between safe and unsafe responses, remaining agnostic to the policy optimization algorithm.

Key facts

Staged-Competence reduces OOD harmful response rates by 16%.
Jailbreak attack success rates drop by 20%.
Matches baseline safety with 75% of training data.
Framework is agnostic to policy optimization algorithm.
Preserves general capabilities with near-zero over-refusal.
Uses curriculum learning to organize preference data by difficulty.
Employs competence-based sampling.
Progressively updates the reference model during training.

Curriculum Learning Boosts Safety Alignment in LLMs

Key facts

Entities

Institutions

Sources