HiPO: Hierarchical Preference Optimization Enhances LLM Reasoning

ai-technology · 2026-04-24

A new paper on arXiv introduces HiPO (Hierarchical Preference Optimization), an extension of Direct Preference Optimization (DPO) for aligning large language models with human preferences in complex reasoning tasks. DPO optimizes entire responses but lacks granular feedback on multi-step solutions. Existing methods like KTO, RSO, ReMA, and Tree of Thoughts excel in either stable preference learning or structured reasoning, but not both. HiPO separates responses into reasoning segments—query clarification, reasoning steps, and answer—and computes a weighted sum of DPO losses per segment, enabling segment-specific training while maintaining DPO's computational efficiency.

Key facts

HiPO is proposed as an extension of DPO.
It separates responses into reasoning segments.
Segments include query clarification, reasoning steps, and answer.
Loss is computed as weighted sum of DPO losses per segment.
Existing methods like KTO and RSO excel at stable preference learning.
ReMA and Tree of Thoughts excel at structured reasoning.
HiPO aims to merge complementary strengths.
The paper is on arXiv with ID 2604.20140.

HiPO: Hierarchical Preference Optimization Enhances LLM Reasoning

Key facts

Entities

Institutions

Sources