New AI research proposes DT2IT-MRM method for improving multimodal reward models

ai-technology · 2026-04-22

A recent study has unveiled DT2IT-MRM, a technique aimed at improving multimodal reward models (MRMs) that synchronize Multimodal Large Language Models (MLLMs) with human preferences. This method tackles three significant issues found in current multimodal preference datasets: a lack of detailed preference strength, biases in textual style, and unreliable preference indicators. Furthermore, existing open-source datasets are plagued by considerable noise and lack scalable curation solutions. DT2IT-MRM features a debiased preference construction pipeline, redefines text-to-image (T2I) preference data, and utilizes an iterative training framework for dataset curation. Experimental findings indicate that this approach significantly enhances the quality of MRM training. The research was published on arXiv under the identifier arXiv:2604.19544v1.

Key facts

DT2IT-MRM is a new method for multimodal reward modeling
Addresses lack of granularity in preference strength in existing datasets
Targets textual style bias in multimodal preference data
Aims to correct unreliable preference signals
Open-source multimodal preference datasets suffer from substantial noise
Integrates debiased preference construction pipeline
Reformulates text-to-image (T2I) preference data
Uses iterative training framework to curate existing datasets

New AI research proposes DT2IT-MRM method for improving multimodal reward models

Key facts

Entities

Institutions

Sources