SRPO: Token-Level Credit Assignment Improves Multimodal Reasoning

ai-technology · 2026-05-11

A recent publication on arXiv (2605.07274) presents Structured Role-Aware Policy Optimization (SRPO), a technique enhancing reinforcement learning from verifiable rewards (RLVR) specifically for multimodal reasoning in large vision-language models (LVLMs). Traditional RLVR methods, like Group Relative Policy Optimization (GRPO), provide sequence-level rewards that do not differentiate the roles of various tokens, leaving uncertainty about whether a correct response is backed by relevant visual information. SRPO resolves this issue by breaking down structured outputs into perception tokens (for visual evidence extraction) and reasoning tokens (for answer derivation). It transforms the benefits of sequence-level GRPO into role-aware token-level advantages, facilitating more accurate credit assignment while preserving the original model architecture's integrity.

Key facts

Paper title: Structured Role-Aware Policy Optimization for Multimodal Reasoning
arXiv identifier: 2605.07274
Announce type: new
RLVR with GRPO is used for improving reasoning in LVLMs
Sequence-level rewards do not distinguish token functional roles
SRPO decomposes responses into perception and reasoning tokens
SRPO refines sequence-level GRPO advantage into token-level advantages
Goal: ensure correct answers are supported by visual evidence

SRPO: Token-Level Credit Assignment Improves Multimodal Reasoning

Key facts

Entities

Institutions

Sources