Multi-Objective Alignment in LLMs: Preference Dimensional Expansion

ai-technology · 2026-05-13

A new arXiv paper (2605.11679) proposes a method to overcome the safety-helpfulness trade-off in large language model alignment. The authors argue that current approaches like data selection and parameter merging only force compromises along a fixed Pareto frontier. By scaling up rollouts and analyzing multi-dimensional rewards, they find that the conflict stems from prompt-inherent restrictions. The work introduces preference dimensional expansion to break the zero-sum conflict between helpfulness and harmlessness.

Key facts

arXiv paper 2605.11679
Addresses safety-helpfulness ceiling in LLM alignment
Multi-objective alignment involves zero-sum conflict
Prior work uses data selection, parameter merging, algorithmic balancing
New approach: preference dimensional expansion
Scaling up rollouts and analyzing multi-dimensional rewards
Conflict arises from prompt-inherent restrictions
Aims to break fixed Pareto frontier compromises

Multi-Objective Alignment in LLMs: Preference Dimensional Expansion

Key facts

Entities

Institutions

Sources