Adaptive Layerwise Perturbation for LLM RL Off-Policy Correction
A novel approach known as Adaptive Layerwise Perturbation (ALP) tackles off-policy challenges in reinforcement learning for large language models (LLMs). Issues like policy staleness and mismatches between training and inference impede both training stability and exploration in LLM RL. Enhanced inference techniques exacerbate the distribution gap between inference and updated policies, resulting in significant importance ratios. These ratios become inflated when the policy is locally sharp, causing gradients to swell and updates to exceed the trust region. ALP introduces minor learnable perturbations into the hidden states of each layer during updates, utilizing the perturbed policy as the numerator for the importance ratio against the static inference policy. This controlled noise in intermediate representations helps maintain alignment between the updated and inference policies. The method is detailed in a paper on arXiv (2603.19470).
Key facts
- ALP addresses off-policy problems in LLM RL.
- Off-policy problems include policy staleness and training-inference mismatch.
- Distribution gap grows due to inference efficiency techniques.
- Heavy-tailed importance ratios arise from locally sharp policies.
- Heavy-tailed ratios inflate gradients and push updates outside trust region.
- ALP injects learnable perturbations into hidden states of each layer.
- Perturbed policy is used as numerator of importance ratio.
- Paper available on arXiv with ID 2603.19470.
Entities
Institutions
- arXiv