Block-Based Double Decoders: A New Transformer Architecture

ai-technology · 2026-05-20

Researchers propose block-based double decoders, a novel transformer architecture that uses doubly-causal block-based attention masks. This design combines decoder-only training efficiency with encoder-decoder inference efficiency, addressing sparse supervision and dynamic sequence length issues in encoder-decoder models. Scaling law experiments show block-based double decoders outperform encoder-decoders and closely track decoder-only models. At inference, they reduce KV-cache memory and per-token compute by at least two-thirds without sacrificing prefill caching or other optimizations.

Key facts

Block-based double decoders use doubly-causal block-based attention masks.
The architecture combines decoder-only training efficiency with encoder-decoder inference efficiency.
It addresses sparse supervision and dynamic sequence lengths in encoder-decoder models.
Scaling law experiments show strong performance over encoder-decoders.
Block-based double decoders closely track decoder-only models across scales.
Inference-time KV-cache memory and per-token compute are reduced by at least 2/3.
Existing inference optimizations for decoder-only models are preserved.
The paper is submitted to arXiv under Computer Science > Machine Learning.

Block-Based Double Decoders: A New Transformer Architecture

Key facts

Entities

Institutions

Sources