ARTFEED — Contemporary Art Intelligence

LLM Architecture Advances: KV Sharing, mHC, and Compressed Attention

ai-technology · 2026-05-16

Google's Gemma 4 models introduce cross-layer KV sharing and per-layer embeddings (PLE) to reduce memory and increase capacity. Gemma 4 E2B has 35 layers, with only the first 15 computing their own KV projections; the final 20 reuse KV tensors, saving about 2.7 GB at 128K context. PLE adds layer-specific token vectors without scaling the main transformer stack. ZAYA1-8B uses compressed convolutional attention to reduce KV cache size. Laguna XS.2 implements layer-wise attention budgeting. DeepSeek V4 introduces mHC (multi-head compression) and compressed attention. These designs target long-context efficiency for reasoning models and agent workflows.

Key facts

  • Gemma 4 E2B has 35 transformer layers; 15 compute own KV, 20 reuse.
  • KV sharing saves ~2.7 GB at bfloat16 for 128K context in E2B.
  • Gemma 4 E4B has 42 layers; 24 compute own KV, 18 share.
  • PLE adds per-layer embedding slices to increase capacity without scaling transformer.
  • ZAYA1-8B uses compressed convolutional attention.
  • Laguna XS.2 uses layer-wise attention budgeting.
  • DeepSeek V4 uses mHC and compressed attention.
  • All designs focus on reducing KV cache size for long contexts.

Entities

Artists

  • Sebastian Raschka

Institutions

  • Google
  • Ahead of AI

Sources