vLLM V1 Matches V0 After Fixing Logprobs, Defaults, and Precision
ServiceNow AI engineers fixed four backend issues to make vLLM V1 match V0 in online RL training: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head. The migration targeted backend parity before any RL objective changes. The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1. Initial V1 attempts showed divergence in clip rate, KL, entropy, and reward. Fixes included setting logprobs-mode=processed_logprobs, disabling prefix caching, matching inflight update mode with clear_cache=False, and enabling fp32 lm_head. After fixes, final V1 run tracked V0 reference across all metrics. The team emphasized fixing backend correctness before adding objective-side corrections like truncated importance sampling.
Key facts
- vLLM V1 is a substantial rewrite of the V0 engine.
- Four fixes were needed: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head.
- The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1.
- Initial V1 run showed divergence in clip rate, KL, entropy, and reward.
- Setting logprobs-mode=processed_logprobs fixed the semantic logprob bug.
- Disabling prefix caching removed a V1-only degree of freedom.
- Inflight weight update used mode='keep' and clear_cache=False.
- fp32 lm_head was needed to match trainer-side logit computation.
- The team fixed backend correctness before adding objective-side corrections.
- The same class of mismatch can surface in PPO, GRPO, or any online RL system.
Entities
Institutions
- ServiceNow AI
- vLLM