GAC: Adaptive Mixing for Hybrid SFT-RL Post-Training
Researchers propose GAC, a noise-aware controller for hybrid post-training that adaptively mixes supervised fine-tuning and reinforcement learning signals. The method estimates gradient variance and disagreement between the two signals to compute a dynamic mixing weight, with smoothing, prior guidance, and bounded updates. Experiments on math, code, science, and logic benchmarks show consistent improvements over fixed and rule-based baselines, especially at larger model scales, with less than 1% training overhead.
Key facts
- GAC stands for noise-aware adaptive mixing for hybrid SFT-RL post-training.
- Fixed mixing schedules cannot adapt when relative noise of signals changes.
- GAC derives adaptive mixing weight from online estimates of gradient variance and disagreement.
- Method adds smoothing, prior guidance, and bounded updates.
- Reuses existing training tensors.
- Experiments on math, code, science, and logic benchmarks.
- Consistent improvements over strong fixed and rule-based baselines.
- Larger gains at larger model scales with less than 1% training overhead.
Entities
Institutions
- arXiv