NFPO: Multi-Step Likelihood-Ratio Correction for RLVR

other · 2026-05-22

A new reinforcement learning algorithm, N-Step Forward-Trace Policy Optimization (NFPO), improves the reasoning ability of large language models by correcting the structural bias in PPO surrogate objectives. The method introduces an N-step forward trace that augments the PPO objective using cumulative likelihood ratios of subsequent tokens. NFPO integrates this trace into a masked policy gradient framework, providing a continuous bridge between the PPO surrogate and the exact policy gradient. The work is published on arXiv under identifier 2605.20865.

Key facts

RLVR improves reasoning in large language models.
PPO surrogate objectives are local approximations.
Local approximation introduces structural bias.
Trust region mechanisms control the bias.
NFPO uses N-step forward trace.
Forward trace uses cumulative likelihood ratios.
NFPO integrates into masked policy gradient.
NFPO bridges PPO surrogate and exact gradient.

Entities

—

Sources

arXiv cs.AI — 2026-05-21