DeepSeek V3, OLMo 2, and the State of LLM Architecture in 2025
Seven years after GPT-2, large language model architectures remain structurally similar, but key refinements have emerged. DeepSeek V3 (671B parameters, Dec 2024) uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) with 256 experts and a shared expert, activating only 37B parameters per token. MLA compresses key/value tensors for KV cache efficiency, outperforming Grouped-Query Attention in modeling performance per DeepSeek-V2 ablations. OLMo 2 (Allen Institute for AI, Jan 2025) adopts a Post-Norm placement of RMSNorm layers (inside residual connections) and adds QK-Norm, improving training stability. It retains standard Multi-Head Attention rather than GQA or MLA. The article compares these and other 2025 models (Llama 4, Gemma 4, Qwen 3) focusing on architectural choices like positional embeddings (RoPE), activation functions (SwiGLU), and normalization strategies.
Key facts
- DeepSeek V3 has 671 billion total parameters but activates only 37 billion per token via MoE.
- DeepSeek V3 uses Multi-Head Latent Attention (MLA) instead of Grouped-Query Attention.
- MLA compresses key and value tensors into a lower-dimensional space for KV cache efficiency.
- OLMo 2 uses Post-Norm (RMSNorm after attention and FFN) inside residual connections.
- OLMo 2 adds QK-Norm for training stability.
- OLMo 2 still uses standard Multi-Head Attention (MHA).
- DeepSeek V3 has 256 experts per MoE module plus one shared expert.
- The article was last updated on Apr 2, 2026 (added Gemma 4).
Entities
Artists
- Sebastian Raschka
Institutions
- Allen Institute for AI
- Substack