Block-Based Double Decoders: A New Transformer Architecture
Researchers propose block-based double decoders, a novel transformer architecture that uses doubly-causal block-based attention masks. This design combines decoder-only training efficiency with encoder-decoder inference efficiency, addressing sparse supervision and dynamic sequence length issues in encoder-decoder models. Scaling law experiments show block-based double decoders outperform encoder-decoders and closely track decoder-only models. At inference, they reduce KV-cache memory and per-token compute by at least two-thirds without sacrificing prefill caching or other optimizations.
Key facts
- Block-based double decoders use doubly-causal block-based attention masks.
- The architecture combines decoder-only training efficiency with encoder-decoder inference efficiency.
- It addresses sparse supervision and dynamic sequence lengths in encoder-decoder models.
- Scaling law experiments show strong performance over encoder-decoders.
- Block-based double decoders closely track decoder-only models across scales.
- Inference-time KV-cache memory and per-token compute are reduced by at least 2/3.
- Existing inference optimizations for decoder-only models are preserved.
- The paper is submitted to arXiv under Computer Science > Machine Learning.
Entities
Institutions
- arXiv