Bootstrapped Mixed Rewards Improve RL Post-Training for Transformers
A new arXiv paper (2512.04277v3) proposes injecting canonical action order as a mixed reward during reinforcement learning (RL) post-training to improve Transformer performance, even when fine-tuned on randomized solution sequences. The method uses Group Relative Policy Optimization (GRPO) with two rewards: a sparse task reward (1 only when fully solved) and an ordering reward that aligns emission order with a canonical solver order. Fixed mixtures with bootstrapped scaling equalize component magnitudes at initialization. On Zebra puzzles, mixed rewards outperform task-only optimization, suggesting coarse ordering signals can effectively steer RL post-training.
Key facts
- arXiv:2512.04277v3
- Bootstrapped mixed rewards for RL post-training
- Injects canonical action order as a reward signal
- Uses GRPO with sparse task reward and ordering reward
- Fixed mixtures with bootstrapped scaling
- Tested on Zebra puzzles
- Mixed rewards outperform task-only optimization
- Coarse ordering signals steer RL post-training
Entities
Institutions
- arXiv