ARTFEED — Contemporary Art Intelligence

RC-DPO: Mitigating Hallucinations in Multimodal Large Reasoning Models

ai-technology · 2026-05-28

A new paper on arXiv (2605.27906) introduces Reasoning-Conditioned Direct Preference Optimization (RC-DPO) to reduce hallucinations in Multimodal Large Reasoning Models. The authors argue that existing response-level Direct Preference Optimization (DPO) treats Chain-of-Thought (CoT) and final answer as a monolithic output, leading to insufficient CoT-level supervision. RC-DPO explicitly models CoT as a condition for answer generation and contrasts preferences under different CoT conditions for the same preferred answer, aiming to improve reasoning quality and reduce hallucinations.

Key facts

  • arXiv paper 2605.27906 proposes RC-DPO
  • RC-DPO addresses hallucinations in Multimodal Large Reasoning Models
  • Existing DPO treats CoT and answer as monolithic output
  • RC-DPO models CoT as condition for answer generation
  • RC-DPO contrasts preferences under different CoT conditions
  • Paper reveals response-level DPO performs similarly to answer-only optimization
  • RC-DPO promotes answer-specific CoT-level supervision
  • Method aims to improve reasoning and reduce hallucinations

Entities

Institutions

  • arXiv

Sources