SRPO: Token-Level Credit Assignment Improves Multimodal Reasoning
A recent publication on arXiv (2605.07274) presents Structured Role-Aware Policy Optimization (SRPO), a technique enhancing reinforcement learning from verifiable rewards (RLVR) specifically for multimodal reasoning in large vision-language models (LVLMs). Traditional RLVR methods, like Group Relative Policy Optimization (GRPO), provide sequence-level rewards that do not differentiate the roles of various tokens, leaving uncertainty about whether a correct response is backed by relevant visual information. SRPO resolves this issue by breaking down structured outputs into perception tokens (for visual evidence extraction) and reasoning tokens (for answer derivation). It transforms the benefits of sequence-level GRPO into role-aware token-level advantages, facilitating more accurate credit assignment while preserving the original model architecture's integrity.
Key facts
- Paper title: Structured Role-Aware Policy Optimization for Multimodal Reasoning
- arXiv identifier: 2605.07274
- Announce type: new
- RLVR with GRPO is used for improving reasoning in LVLMs
- Sequence-level rewards do not distinguish token functional roles
- SRPO decomposes responses into perception and reasoning tokens
- SRPO refines sequence-level GRPO advantage into token-level advantages
- Goal: ensure correct answers are supported by visual evidence
Entities
Institutions
- arXiv