LLM Architecture Advances: KV Sharing, mHC, and Compressed Attention
Google's Gemma 4 models introduce cross-layer KV sharing and per-layer embeddings (PLE) to reduce memory and increase capacity. Gemma 4 E2B has 35 layers, with only the first 15 computing their own KV projections; the final 20 reuse KV tensors, saving about 2.7 GB at 128K context. PLE adds layer-specific token vectors without scaling the main transformer stack. ZAYA1-8B uses compressed convolutional attention to reduce KV cache size. Laguna XS.2 implements layer-wise attention budgeting. DeepSeek V4 introduces mHC (multi-head compression) and compressed attention. These designs target long-context efficiency for reasoning models and agent workflows.
Key facts
- Gemma 4 E2B has 35 transformer layers; 15 compute own KV, 20 reuse.
- KV sharing saves ~2.7 GB at bfloat16 for 128K context in E2B.
- Gemma 4 E4B has 42 layers; 24 compute own KV, 18 share.
- PLE adds per-layer embedding slices to increase capacity without scaling transformer.
- ZAYA1-8B uses compressed convolutional attention.
- Laguna XS.2 uses layer-wise attention budgeting.
- DeepSeek V4 uses mHC and compressed attention.
- All designs focus on reducing KV cache size for long contexts.
Entities
Artists
- Sebastian Raschka
Institutions
- Ahead of AI