Component-Aware Self-Speculative Decoding for Hybrid Language Models

ai-technology · 2026-05-06

A novel technique called component-aware self-speculative decoding has been unveiled by researchers, marking the first instance of leveraging architectural diversity in hybrid language models. This method identifies the SSM/linear-attention subgraph as a zero-cost internal draft, eliminating the necessity for an external drafter. The evaluation was conducted on Falcon-H1 (which combines parallel Mamba-2 and attention per layer) and Qwen3.5 (featuring sequential interleaved linear and attention layers), with Qwen2.5 serving as a control for pure Transformers. The results indicate that parallel hybrids achieve acceptance rates of α = 0.68 at draft length k=2 during greedy decoding, whereas sequential hybrids only attain α = 0.038, revealing an 18x difference due to architectural factors. This advancement broadens the scope of self-speculative decoding beyond uniform Transformers.

Key facts

Component-aware self-speculative decoding is introduced for hybrid language models.
It isolates the SSM/linear-attention subgraph as a zero-cost internal draft.
Evaluated on Falcon-H1 (parallel architecture) and Qwen3.5 (sequential architecture).
Pure Transformer Qwen2.5 used as control.
Parallel hybrids achieve acceptance rate α = 0.68 at draft length k=2 under greedy decoding.
Sequential hybrids yield α = 0.038, an 18x gap.
First method to exploit internal architectural heterogeneity for self-speculative decoding.
Published on arXiv with ID 2605.01106.

Component-Aware Self-Speculative Decoding for Hybrid Language Models

Key facts

Entities

Institutions

Sources