GEAR: Adaptive Credit Assignment for LLM Agents via Self-Distillation

ai-technology · 2026-05-13

Researchers propose GEAR (Granularity-adaptivE Advantage Reweighting), a credit assignment framework for reinforcement learning in LLM agents. The method addresses the limitation of coarse outcome-level rewards by using token- and segment-level signals from self-distillation. GEAR reshapes trajectory-level GRPO advantage by comparing an on-policy student with a ground-truth-conditioned teacher to identify adaptive segment boundaries and modulate local advantage weights. The divergence signal spikes at semantic deviations, improving credit assignment in long-horizon trajectories. The paper is available on arXiv (2605.11853).

Key facts

GEAR is a credit assignment framework for LLM agents.
It uses token- and segment-level signals from self-distillation.
It reshapes trajectory-level GRPO advantage.
It compares on-policy student with ground-truth-conditioned teacher.
Divergence signal identifies adaptive segment boundaries.
Divergence spikes at onset of semantic deviation.
Paper available on arXiv: 2605.11853.
Announce type: cross.

GEAR: Adaptive Credit Assignment for LLM Agents via Self-Distillation

Key facts

Entities

Institutions

Sources