Cumulative Token IS Ratio Proposed for LLM Policy Optimization
A recent study on arXiv presents a new approach called cumulative token importance sampling (IS) ratio, which addresses the bias-variance dilemma in off-policy policy-gradient estimation for large language models (LLMs). Current techniques, including PPO and GRPO, utilize token-level IS ratios that may introduce bias due to the neglect of variations in prefix state distributions. While full sequence ratios can enhance precision at the trajectory level, they often lead to increased variance. The introduced cumulative token IS ratio aims to balance these factors, enhancing numerical stability and relevance in reinforcement learning scenarios with verifiable rewards.
Key facts
- Paper arXiv:2605.07331 proposes cumulative token IS ratio for LLM policy optimization.
- Existing methods face bias-variance dilemma in off-policy policy-gradient estimation.
- PPO and GRPO use token-level IS ratios that introduce bias.
- Full sequence ratios provide exact correction but suffer high variance.
- GSPO uses length normalization but deviates from exact IS correction.
- Cumulative token IS ratio is product of per-token ratios up to position t.
- Work applies to reinforcement learning with verifiable rewards (RLVR).
- Authors include Schulman et al. (2017), Shao et al. (2024), Zheng et al. (2025).
Entities
Institutions
- arXiv