Shadow Mask Distillation Reduces KV Cache Memory in RL Post-Training

ai-technology · 2026-05-11

A new method called Shadow Mask Distillation (SMD) addresses the memory bottleneck in reinforcement learning (RL) post-training of large language models (LLMs). During online RL, the rollout phase generates exploratory trajectories, but long-context reasoning tasks cause a massive Key-Value (KV) cache footprint, creating a "memory wall." Existing KV cache compression techniques, while nearly lossless during standard inference, introduce off-policy bias when applied during rollouts because even small approximation errors are amplified by RL's instability. SMD mitigates this by distilling a full-context teacher into a sparse-context student, enabling memory-efficient alignment without sacrificing performance. The method is compatible with popular RL frameworks like RLHF, RLAIF, PPO, GRPO, and Online DPO. Experiments show SMD reduces KV cache memory by up to 4x while maintaining or improving task accuracy on long-context reasoning benchmarks. The paper is available on arXiv under ID 2605.06850.

Key facts

Shadow Mask Distillation (SMD) is proposed for memory-efficient RL post-training.
Online RL requires a rollout phase that creates a large KV cache footprint.
KV cache compression during rollouts causes off-policy bias.
SMD uses distillation from a full-context teacher to a sparse-context student.
The method works with RLHF, RLAIF, PPO, GRPO, and Online DPO.
SMD reduces KV cache memory by up to 4x.
Task accuracy is maintained or improved on long-context reasoning benchmarks.
The paper is arXiv:2605.06850.

Shadow Mask Distillation Reduces KV Cache Memory in RL Post-Training

Key facts

Entities

Institutions

Sources