ARTFEED — Contemporary Art Intelligence

Curriculum Learning Boosts Safety Alignment in LLMs

ai-technology · 2026-05-27

A recent paper on arXiv (2605.26315) presents Staged-Competence, a framework for curriculum learning aimed at enhancing Direct Preference Optimisation (DPO) for safety alignment in large language models. This approach categorizes preference data based on difficulty, employs competence-based sampling, and incrementally updates the reference model. Staged-Competence achieves a 16% reduction in harmful response rates for out-of-distribution scenarios and a 20% decrease in the success rates of jailbreak attacks across three model families, all while preserving general capabilities with nearly no over-refusal. It achieves baseline safety standards using only 75% of the training data and creates a clearer distinction between safe and unsafe responses, remaining agnostic to the policy optimization algorithm.

Key facts

  • Staged-Competence reduces OOD harmful response rates by 16%.
  • Jailbreak attack success rates drop by 20%.
  • Matches baseline safety with 75% of training data.
  • Framework is agnostic to policy optimization algorithm.
  • Preserves general capabilities with near-zero over-refusal.
  • Uses curriculum learning to organize preference data by difficulty.
  • Employs competence-based sampling.
  • Progressively updates the reference model during training.

Entities

Institutions

  • arXiv

Sources