Curriculum Learning Boosts Safety Alignment in LLMs
A recent paper on arXiv (2605.26315) presents Staged-Competence, a framework for curriculum learning aimed at enhancing Direct Preference Optimisation (DPO) for safety alignment in large language models. This approach categorizes preference data based on difficulty, employs competence-based sampling, and incrementally updates the reference model. Staged-Competence achieves a 16% reduction in harmful response rates for out-of-distribution scenarios and a 20% decrease in the success rates of jailbreak attacks across three model families, all while preserving general capabilities with nearly no over-refusal. It achieves baseline safety standards using only 75% of the training data and creates a clearer distinction between safe and unsafe responses, remaining agnostic to the policy optimization algorithm.
Key facts
- Staged-Competence reduces OOD harmful response rates by 16%.
- Jailbreak attack success rates drop by 20%.
- Matches baseline safety with 75% of training data.
- Framework is agnostic to policy optimization algorithm.
- Preserves general capabilities with near-zero over-refusal.
- Uses curriculum learning to organize preference data by difficulty.
- Employs competence-based sampling.
- Progressively updates the reference model during training.
Entities
Institutions
- arXiv