Mutual Reinforcement Learning Framework for Heterogeneous LLMs

other · 2026-05-11

A novel framework known as Mutual Reinforcement Learning has been developed, enabling distinct families of large language models (LLMs) to collaboratively learn post-training despite differing objectives and configurations. Key components of this system include features such as Shared Experience Exchange (SEE) and Multi-Worker Resource Allocation (MWRA), along with a Tokenizer Heterogeneity Layer (THL) for efficient retokenization. Additionally, three innovative tools inspired by Generalized Randomized Play Optimization (GRPO) have been introduced: Peer Rollout Pooling (PRP), Cross-Policy GRPO Advantage Sharing (XGRPO), and Success-Gated Transfer (SGT). Research indicates these methods require a careful balance between stability and support.

Key facts

Introduced Mutual Reinforcement Learning for heterogeneous LLMs
Framework includes SEE, MWRA, and THL components
THL retokenizes text across incompatible vocabularies
Three probes: PRP, XGRPO, SGT
Based on GRPO algorithm
Contextual-bandit analysis shows stability-support trade-off
PRP incurs density-ratio variance and THL residual
Published on arXiv with ID 2605.07244

Mutual Reinforcement Learning Framework for Heterogeneous LLMs

Key facts

Entities

Institutions

Sources