PRISM: New Method Aligns Multimodal Models Before Reinforcement Learning

ai-technology · 2026-05-01

A new three-stage pipeline named PRISM has been developed by researchers to enhance the post-training process of large multimodal models (LMMs) by tackling distributional drift that occurs due to supervised fine-tuning (SFT). The conventional approach of SFT followed by reinforcement learning with verifiable rewards (RLVR) faces issues of drift, which can diminish original capabilities and misalign supervision, particularly in multimodal reasoning where errors in perception and reasoning can amplify. PRISM incorporates a specific distribution-alignment phase between SFT and RLVR, utilizing on-policy distillation (OPD) as a black-box adversarial game involving the policy and a Mixture-of-Experts (MoE) discriminator, featuring specialized experts in perception and reasoning. This method offers distinct corrective signals to reduce drift. The findings are available in arXiv preprint 2604.28123.

Key facts

PRISM is a three-stage pipeline for post-training large multimodal models.
It addresses distributional drift from supervised fine-tuning (SFT).
Standard recipe: SFT then reinforcement learning with verifiable rewards (RLVR).
Drift is amplified in multimodal reasoning due to perception and reasoning errors.
PRISM inserts an alignment stage between SFT and RLVR.
Uses on-policy distillation (OPD) as a black-box adversarial game.
Employs a Mixture-of-Experts (MoE) discriminator with perception and reasoning experts.
Published on arXiv with ID 2604.28123.

PRISM: New Method Aligns Multimodal Models Before Reinforcement Learning

Key facts

Entities

Institutions

Sources