RC-DPO: Mitigating Hallucinations in Multimodal Large Reasoning Models

ai-technology · 2026-05-28

A new paper on arXiv (2605.27906) introduces Reasoning-Conditioned Direct Preference Optimization (RC-DPO) to reduce hallucinations in Multimodal Large Reasoning Models. The authors argue that existing response-level Direct Preference Optimization (DPO) treats Chain-of-Thought (CoT) and final answer as a monolithic output, leading to insufficient CoT-level supervision. RC-DPO explicitly models CoT as a condition for answer generation and contrasts preferences under different CoT conditions for the same preferred answer, aiming to improve reasoning quality and reduce hallucinations.

Key facts

arXiv paper 2605.27906 proposes RC-DPO
RC-DPO addresses hallucinations in Multimodal Large Reasoning Models
Existing DPO treats CoT and answer as monolithic output
RC-DPO models CoT as condition for answer generation
RC-DPO contrasts preferences under different CoT conditions
Paper reveals response-level DPO performs similarly to answer-only optimization
RC-DPO promotes answer-specific CoT-level supervision
Method aims to improve reasoning and reduce hallucinations

RC-DPO: Mitigating Hallucinations in Multimodal Large Reasoning Models

Key facts

Entities

Institutions

Sources