Mutual Reinforcement Learning Framework for Heterogeneous LLMs
A novel framework known as Mutual Reinforcement Learning has been developed, enabling distinct families of large language models (LLMs) to collaboratively learn post-training despite differing objectives and configurations. Key components of this system include features such as Shared Experience Exchange (SEE) and Multi-Worker Resource Allocation (MWRA), along with a Tokenizer Heterogeneity Layer (THL) for efficient retokenization. Additionally, three innovative tools inspired by Generalized Randomized Play Optimization (GRPO) have been introduced: Peer Rollout Pooling (PRP), Cross-Policy GRPO Advantage Sharing (XGRPO), and Success-Gated Transfer (SGT). Research indicates these methods require a careful balance between stability and support.
Key facts
- Introduced Mutual Reinforcement Learning for heterogeneous LLMs
- Framework includes SEE, MWRA, and THL components
- THL retokenizes text across incompatible vocabularies
- Three probes: PRP, XGRPO, SGT
- Based on GRPO algorithm
- Contextual-bandit analysis shows stability-support trade-off
- PRP incurs density-ratio variance and THL residual
- Published on arXiv with ID 2605.07244
Entities
Institutions
- arXiv