TIDE: First Cross-Architecture Distillation Framework for Diffusion LLMs
A team of researchers has unveiled TIDE, the inaugural framework that facilitates cross-architecture knowledge distillation for diffusion large language models (dLLMs). Unlike previous techniques restricted to same-architecture transfers, TIDE permits variations in the architecture, attention mechanism, and tokenizer between the teacher and student models. The framework consists of three modular elements: TIDAL, which modifies distillation intensity based on training advancement and diffusion timestep to reflect the teacher's noise-dependent reliability; CompDemo, which enhances teacher context through complementary mask splitting for improved predictions in heavily masked scenarios; and Reverse CALM, a cross-tokenizer objective that reverses chunk-level likelihood matching for constrained gradients. This research fills a significant gap in dLLM distillation, as leading dLLMs require billions of parameters for optimal performance. The paper can be accessed on arXiv with ID 2604.26951.
Key facts
- TIDE is the first cross-architecture distillation framework for diffusion large language models.
- It allows teacher and student to differ in architecture, attention mechanism, and tokenizer.
- TIDAL modulates distillation strength across training progress and diffusion timestep.
- CompDemo uses complementary mask splitting to improve predictions under heavy masking.
- Reverse CALM is a cross-tokenizer objective for bounded gradient matching.
- Prior distillation methods for dLLMs only work within a single architecture.
- State-of-the-art dLLMs need billions of parameters for competitive performance.
- The paper is published on arXiv with ID 2604.26951.
Entities
Institutions
- arXiv