RC-DPO: Mitigating Hallucinations in Multimodal Large Reasoning Models
A new paper on arXiv (2605.27906) introduces Reasoning-Conditioned Direct Preference Optimization (RC-DPO) to reduce hallucinations in Multimodal Large Reasoning Models. The authors argue that existing response-level Direct Preference Optimization (DPO) treats Chain-of-Thought (CoT) and final answer as a monolithic output, leading to insufficient CoT-level supervision. RC-DPO explicitly models CoT as a condition for answer generation and contrasts preferences under different CoT conditions for the same preferred answer, aiming to improve reasoning quality and reduce hallucinations.
Key facts
- arXiv paper 2605.27906 proposes RC-DPO
- RC-DPO addresses hallucinations in Multimodal Large Reasoning Models
- Existing DPO treats CoT and answer as monolithic output
- RC-DPO models CoT as condition for answer generation
- RC-DPO contrasts preferences under different CoT conditions
- Paper reveals response-level DPO performs similarly to answer-only optimization
- RC-DPO promotes answer-specific CoT-level supervision
- Method aims to improve reasoning and reduce hallucinations
Entities
Institutions
- arXiv