Multi-Objective Alignment in LLMs: Preference Dimensional Expansion
A new arXiv paper (2605.11679) proposes a method to overcome the safety-helpfulness trade-off in large language model alignment. The authors argue that current approaches like data selection and parameter merging only force compromises along a fixed Pareto frontier. By scaling up rollouts and analyzing multi-dimensional rewards, they find that the conflict stems from prompt-inherent restrictions. The work introduces preference dimensional expansion to break the zero-sum conflict between helpfulness and harmlessness.
Key facts
- arXiv paper 2605.11679
- Addresses safety-helpfulness ceiling in LLM alignment
- Multi-objective alignment involves zero-sum conflict
- Prior work uses data selection, parameter merging, algorithmic balancing
- New approach: preference dimensional expansion
- Scaling up rollouts and analyzing multi-dimensional rewards
- Conflict arises from prompt-inherent restrictions
- Aims to break fixed Pareto frontier compromises
Entities
Institutions
- arXiv