DiNa-LRM: Diffusion-Native Latent Reward Model for Preference Optimization
Researchers propose DiNa-LRM, a diffusion-native latent reward model that directly formulates preference learning on noisy diffusion states, avoiding the domain mismatch of pixel-space rewards from Vision-Language Models (VLMs). The method uses a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, leveraging a pretrained latent diffusion backbone with a timestep-conditioned reward head. It supports inference-time noise ensembling for test-time scaling. This approach addresses the computational cost and domain mismatch issues of VLM-based rewards in optimizing diffusion and flow-matching models.
Key facts
- DiNa-LRM is a diffusion-native latent reward model.
- It formulates preference learning directly on noisy diffusion states.
- Uses a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty.
- Leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head.
- Supports inference-time noise ensembling.
- Avoids domain mismatch of pixel-space rewards from VLMs.
- Reduces computation and memory cost compared to VLM-based rewards.
- Published on arXiv with ID 2602.11146.
Entities
Institutions
- arXiv