Training-Inference Mismatch Causes LLM RL Collapse

ai-technology · 2026-05-16

A new study from arXiv (submitted May 2025) identifies Training-Inference Mismatch (TIM) as a critical but overlooked failure mode in LLM reinforcement learning. TIM arises when rollout generation and policy optimization stages produce different token probabilities for the same sequence under identical model weights, due to implementation differences. The researchers isolated TIM using a zero-mismatch diagnostic tool called VeXact, demonstrating that even small numerical disagreements at the token level can independently trigger training collapse. They further show TIM alters the effective optimization problem and propose a set of potential remedies. The findings reframe TIM as a first-order systems-level perturbation rather than benign numerical noise, with implications for the stability of modern LLM RL systems.

Key facts

TIM stands for Training-Inference Mismatch in LLM reinforcement learning.
Rollout generation and policy optimization stages are expected to produce matching token probabilities.
Implementation differences cause TIM, leading to different values for the same sequence under the same weights.
TIM is difficult to inspect because it is entangled with off-policy drift and stabilization mechanisms.
The study uses VeXact, a zero-mismatch diagnostic setting, to isolate TIM.
Small token-level numerical disagreements can independently cause training collapse.
TIM changes the effective optimization problem.
The paper identifies a set of remedies to mitigate TIM.

Training-Inference Mismatch Causes LLM RL Collapse

Key facts

Entities

Institutions

Sources