Bootstrapped Mixed Rewards Improve RL Post-Training for Transformers

ai-technology · 2026-05-07

A new arXiv paper (2512.04277v3) proposes injecting canonical action order as a mixed reward during reinforcement learning (RL) post-training to improve Transformer performance, even when fine-tuned on randomized solution sequences. The method uses Group Relative Policy Optimization (GRPO) with two rewards: a sparse task reward (1 only when fully solved) and an ordering reward that aligns emission order with a canonical solver order. Fixed mixtures with bootstrapped scaling equalize component magnitudes at initialization. On Zebra puzzles, mixed rewards outperform task-only optimization, suggesting coarse ordering signals can effectively steer RL post-training.

Key facts

arXiv:2512.04277v3
Bootstrapped mixed rewards for RL post-training
Injects canonical action order as a reward signal
Uses GRPO with sparse task reward and ordering reward
Fixed mixtures with bootstrapped scaling
Tested on Zebra puzzles
Mixed rewards outperform task-only optimization
Coarse ordering signals steer RL post-training

Bootstrapped Mixed Rewards Improve RL Post-Training for Transformers

Key facts

Entities

Institutions

Sources