AIS Framework Corrects Low-Precision Rollout Bias in RL for LLMs
Researchers propose Adaptive Importance Sampling (AIS) to address the rollout-training mismatch in reinforcement learning for large language models. Low-precision rollouts (e.g., FP8) paired with BF16 trainers improve throughput but introduce non-stationary bias that can destabilize training on reasoning benchmarks. AIS adjusts correction strength per batch using three real-time diagnostics: weight reliability, divergence, and gradient variance. The framework aims to preserve the early exploration benefit of low-precision rollouts while mitigating later bias. The work is published on arXiv under ID 2605.13907.
Key facts
- arXiv:2605.13907v1
- Low-precision rollouts (FP8) paired with BF16 trainer cause rollout-training mismatch
- Mismatch is non-stationary and acts as double-edged sword
- Early training: stochastic exploration bonus
- Later training: destabilizing bias as policy concentrates
- AIS uses per-batch adjustment with three diagnostics: weight reliability, divergence, gradient variance
- AIS aims to correct bias while preserving exploration benefits
- Published on arXiv
Entities
Institutions
- arXiv