TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
A new approach called TUR-DPO has been introduced by researchers, enhancing Direct Preference Optimization (DPO) by integrating topology and uncertainty to better align large language models (LLMs) with human preferences. Unlike traditional DPO, which views preferences merely as binary outcomes and struggles with noisy inputs from delicate reasoning processes, TUR-DPO emphasizes the reasoning behind answers through lightweight reasoning topologies. It merges semantic accuracy, utility, and topology quality into a refined uncertainty signal. A small, learnable reward is distributed across these signals and incorporated into an uncertainty-weighted DPO framework that does not require reinforcement learning, relying instead on either a fixed or adaptive reference policy. Empirical evaluations on 7-8B models and various mathematics and reasoning benchmarks demonstrate notable enhancements. The findings are detailed in a paper available on arXiv, ID 2605.00224.
Key facts
- TUR-DPO is a variant of Direct Preference Optimization (DPO).
- It addresses sensitivity to noisy preferences from fragile chains of thought.
- It rewards how answers are derived, not just what they say.
- It uses lightweight reasoning topologies.
- Combines semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal.
- Incorporates a small learnable reward factorized over these signals.
- Remains reinforcement learning-free.
- Empirical results on open 7-8B models and math/reasoning benchmarks.
Entities
Institutions
- arXiv