LLM Architecture Advances: KV Sharing, mHC, and Compressed Attention

ai-technology · 2026-05-16

Google's Gemma 4 models introduce cross-layer KV sharing and per-layer embeddings (PLE) to reduce memory and increase capacity. Gemma 4 E2B has 35 layers, with only the first 15 computing their own KV projections; the final 20 reuse KV tensors, saving about 2.7 GB at 128K context. PLE adds layer-specific token vectors without scaling the main transformer stack. ZAYA1-8B uses compressed convolutional attention to reduce KV cache size. Laguna XS.2 implements layer-wise attention budgeting. DeepSeek V4 introduces mHC (multi-head compression) and compressed attention. These designs target long-context efficiency for reasoning models and agent workflows.

Key facts

Gemma 4 E2B has 35 transformer layers; 15 compute own KV, 20 reuse.
KV sharing saves ~2.7 GB at bfloat16 for 128K context in E2B.
Gemma 4 E4B has 42 layers; 24 compute own KV, 18 share.
PLE adds per-layer embedding slices to increase capacity without scaling transformer.
ZAYA1-8B uses compressed convolutional attention.
Laguna XS.2 uses layer-wise attention budgeting.
DeepSeek V4 uses mHC and compressed attention.
All designs focus on reducing KV cache size for long contexts.

Entities

Artists

Sebastian Raschka

Institutions

Google
Ahead of AI

Sources

Sebastian Raschka — 2026-05-16