CU-DPO Improves LLM Reasoning with Continuous Utility Signals
A novel approach known as Continuous Utility Direct Preference Optimization (CU-DPO) introduces continuous scores in place of binary preference labels, enabling a more nuanced assessment of reasoning quality in large language models. This method aligns models with a variety of prompt-based cognitive strategies. Theoretical findings demonstrate that employing K strategies results in a Theta(K log K) enhancement in sample complexity compared to binary preferences, and that DPO approaches the entropy-regularized utility-maximizing policy. The training process involves two stages: first, strategy selection, which optimizes the model to identify the most effective strategy through best-vs-all comparisons, followed by execution refinement, focusing on the correct application of the chosen strategy using marginal signals. The paper can be found on arXiv with the reference 2602.00931.
Key facts
- CU-DPO replaces binary labels with continuous scores for reasoning quality.
- Framework aligns models to a portfolio of prompt-based cognitive strategies.
- Learning with K strategies yields Theta(K log K) improvement in sample complexity.
- DPO converges to the entropy-regularized utility-maximizing policy.
- Two-stage pipeline: strategy selection and execution refinement.
- Strategy selection uses best-vs-all comparisons.
- Execution refinement uses marginal signals.
- Paper available on arXiv: 2602.00931.
Entities
Institutions
- arXiv