NFPO: Multi-Step Likelihood-Ratio Correction for RLVR
A new reinforcement learning algorithm, N-Step Forward-Trace Policy Optimization (NFPO), improves the reasoning ability of large language models by correcting the structural bias in PPO surrogate objectives. The method introduces an N-step forward trace that augments the PPO objective using cumulative likelihood ratios of subsequent tokens. NFPO integrates this trace into a masked policy gradient framework, providing a continuous bridge between the PPO surrogate and the exact policy gradient. The work is published on arXiv under identifier 2605.20865.
Key facts
- RLVR improves reasoning in large language models.
- PPO surrogate objectives are local approximations.
- Local approximation introduces structural bias.
- Trust region mechanisms control the bias.
- NFPO uses N-step forward trace.
- Forward trace uses cumulative likelihood ratios.
- NFPO integrates into masked policy gradient.
- NFPO bridges PPO surrogate and exact gradient.
Entities
—