vLLM V1 Matches V0 After Fixing Logprobs, Defaults, and Precision

other · 2026-05-06

ServiceNow AI engineers fixed four backend issues to make vLLM V1 match V0 in online RL training: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head. The migration targeted backend parity before any RL objective changes. The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1. Initial V1 attempts showed divergence in clip rate, KL, entropy, and reward. Fixes included setting logprobs-mode=processed_logprobs, disabling prefix caching, matching inflight update mode with clear_cache=False, and enabling fp32 lm_head. After fixes, final V1 run tracked V0 reference across all metrics. The team emphasized fixing backend correctness before adding objective-side corrections like truncated importance sampling.

Key facts

vLLM V1 is a substantial rewrite of the V0 engine.
Four fixes were needed: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head.
The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1.
Initial V1 run showed divergence in clip rate, KL, entropy, and reward.
Setting logprobs-mode=processed_logprobs fixed the semantic logprob bug.
Disabling prefix caching removed a V1-only degree of freedom.
Inflight weight update used mode='keep' and clear_cache=False.
fp32 lm_head was needed to match trainer-side logit computation.
The team fixed backend correctness before adding objective-side corrections.
The same class of mismatch can surface in PPO, GRPO, or any online RL system.

vLLM V1 Matches V0 After Fixing Logprobs, Defaults, and Precision

Key facts

Entities

Institutions

Sources