ARTFEED — Contemporary Art Intelligence

vLLM V1 Matches V0 After Fixing Logprobs, Defaults, and Precision

other · 2026-05-06

ServiceNow AI engineers fixed four backend issues to make vLLM V1 match V0 in online RL training: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head. The migration targeted backend parity before any RL objective changes. The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1. Initial V1 attempts showed divergence in clip rate, KL, entropy, and reward. Fixes included setting logprobs-mode=processed_logprobs, disabling prefix caching, matching inflight update mode with clear_cache=False, and enabling fp32 lm_head. After fixes, final V1 run tracked V0 reference across all metrics. The team emphasized fixing backend correctness before adding objective-side corrections like truncated importance sampling.

Key facts

  • vLLM V1 is a substantial rewrite of the V0 engine.
  • Four fixes were needed: processed rollout logprobs, V1-specific runtime defaults, inflight weight-update path, and fp32 lm_head.
  • The reference run used vLLM 0.8.5; V1 runs used vLLM 0.18.1.
  • Initial V1 run showed divergence in clip rate, KL, entropy, and reward.
  • Setting logprobs-mode=processed_logprobs fixed the semantic logprob bug.
  • Disabling prefix caching removed a V1-only degree of freedom.
  • Inflight weight update used mode='keep' and clear_cache=False.
  • fp32 lm_head was needed to match trainer-side logit computation.
  • The team fixed backend correctness before adding objective-side corrections.
  • The same class of mismatch can surface in PPO, GRPO, or any online RL system.

Entities

Institutions

  • ServiceNow AI
  • vLLM

Sources