SAPO: A New Reinforcement Learning Method for Multi-Modal Reasoning

other · 2026-05-06

A novel reinforcement learning technique named Segment-Aligned Policy Optimization (SAPO) has been introduced for Large Language Models (LLMs) in tasks requiring multi-modal reasoning. In contrast to current methods that focus on optimizing tokens or sequences, SAPO considers coherent reasoning steps as essential units for updating policies. It implements a step-wise Markov decision process abstraction for reasoning segments, incorporating segment-level value estimation, advantage calculation, and importance sampling. Testing on key reasoning benchmarks indicates that SAPO reliably surpasses existing methods. The research can be found on arXiv with the identifier 2605.01327.

Key facts

SAPO stands for Segment-Aligned Policy Optimization
It is a reinforcement learning paradigm for LLMs
It operates at the granularity of reasoning steps rather than tokens or full sequences
It uses a step-wise Markov decision process abstraction
It includes segment-level value estimation and advantage computation
Experiments were conducted on representative reasoning benchmarks
SAPO consistently outperforms existing approaches
The paper is published on arXiv with ID 2605.01327

SAPO: A New Reinforcement Learning Method for Multi-Modal Reasoning

Key facts

Entities

Institutions

Sources