TRACE: Token-Routed Alignment for Critical Reasoning in Math AI
A new AI training method called TRACE (Token-Routed Alignment for Critical rEasoning) has been developed to improve the reasoning skills of large language models, particularly in math. This method addresses issues seen in on-policy self-distillation (self-OPD), where gradients are wasted on irrelevant tokens, leading to information leaks and poorer reasoning. TRACE zeroes in on critical spans identified by annotators, using forward KL on important spans from accurate outputs, optional reverse KL on error spans, and GRPO on others, phasing out the KL channel after a short warm-up. Results show that forward KL significantly enhances the performance of teacher-supported tokens that students often overlook, while span masking and decay help control the privileged gradient.
Key facts
- TRACE is a new AI training method for reasoning in mathematics.
- It addresses issues in on-policy self-distillation (self-OPD).
- All-token KL divergence wastes gradients on redundant positions.
- Privileged-information leakage causes entropy rise and shortened reasoning.
- TRACE distills only on annotator-marked critical spans.
- Forward KL is applied on key spans of correct rollouts.
- Optional reverse KL is applied on localized error spans.
- GRPO is used on all remaining tokens.
- The KL channel is annealed away after a short warm-up.
- Forward KL provides non-vanishing lift to teacher-supported tokens.
- Span masking and decay keep cumulative privileged-gradient under control.
Entities
—