GEAR: Adaptive Credit Assignment for LLM Agents via Self-Distillation
Researchers propose GEAR (Granularity-adaptivE Advantage Reweighting), a credit assignment framework for reinforcement learning in LLM agents. The method addresses the limitation of coarse outcome-level rewards by using token- and segment-level signals from self-distillation. GEAR reshapes trajectory-level GRPO advantage by comparing an on-policy student with a ground-truth-conditioned teacher to identify adaptive segment boundaries and modulate local advantage weights. The divergence signal spikes at semantic deviations, improving credit assignment in long-horizon trajectories. The paper is available on arXiv (2605.11853).
Key facts
- GEAR is a credit assignment framework for LLM agents.
- It uses token- and segment-level signals from self-distillation.
- It reshapes trajectory-level GRPO advantage.
- It compares on-policy student with ground-truth-conditioned teacher.
- Divergence signal identifies adaptive segment boundaries.
- Divergence spikes at onset of semantic deviation.
- Paper available on arXiv: 2605.11853.
- Announce type: cross.
Entities
Institutions
- arXiv