ARTFEED — Contemporary Art Intelligence

Bootstrapped Mixed Rewards Improve RL Post-Training for Transformers

ai-technology · 2026-05-07

A new arXiv paper (2512.04277v3) proposes injecting canonical action order as a mixed reward during reinforcement learning (RL) post-training to improve Transformer performance, even when fine-tuned on randomized solution sequences. The method uses Group Relative Policy Optimization (GRPO) with two rewards: a sparse task reward (1 only when fully solved) and an ordering reward that aligns emission order with a canonical solver order. Fixed mixtures with bootstrapped scaling equalize component magnitudes at initialization. On Zebra puzzles, mixed rewards outperform task-only optimization, suggesting coarse ordering signals can effectively steer RL post-training.

Key facts

  • arXiv:2512.04277v3
  • Bootstrapped mixed rewards for RL post-training
  • Injects canonical action order as a reward signal
  • Uses GRPO with sparse task reward and ordering reward
  • Fixed mixtures with bootstrapped scaling
  • Tested on Zebra puzzles
  • Mixed rewards outperform task-only optimization
  • Coarse ordering signals steer RL post-training

Entities

Institutions

  • arXiv

Sources