ARTFEED — Contemporary Art Intelligence

NFPO: Multi-Step Likelihood-Ratio Correction for RLVR

other · 2026-05-22

A new reinforcement learning algorithm, N-Step Forward-Trace Policy Optimization (NFPO), improves the reasoning ability of large language models by correcting the structural bias in PPO surrogate objectives. The method introduces an N-step forward trace that augments the PPO objective using cumulative likelihood ratios of subsequent tokens. NFPO integrates this trace into a masked policy gradient framework, providing a continuous bridge between the PPO surrogate and the exact policy gradient. The work is published on arXiv under identifier 2605.20865.

Key facts

  • RLVR improves reasoning in large language models.
  • PPO surrogate objectives are local approximations.
  • Local approximation introduces structural bias.
  • Trust region mechanisms control the bias.
  • NFPO uses N-step forward trace.
  • Forward trace uses cumulative likelihood ratios.
  • NFPO integrates into masked policy gradient.
  • NFPO bridges PPO surrogate and exact gradient.

Entities

Sources