LLMs Develop Hierarchical Processing Layers Based on Architecture

ai-technology · 2026-05-07

A recent investigation of eight Transformer models, ranging from 7B to 70B parameters, within the Llama and Qwen families, shows that each model naturally establishes distinct functional boundaries that categorize their layers into Local, Intermediate, and Global processing segments. The positioning of these boundaries, along with the fragility of each segment, is primarily influenced by the architecture family rather than the size of the model or its training setup. Researchers have introduced the Multi-Scale Probabilistic Generation Theory (MSPGT), conceptualizing an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system. All eight models strongly validate three predictions.

Key facts

Eight Transformer models from Llama and Qwen families were analyzed
Models range from 7B to 70B parameters
All models develop Local, Intermediate, and Global processing segments
Boundary locations depend on architecture family, not model size
MSPGT formalizes the hierarchical structure
Three predictions of MSPGT are confirmed

Entities

—

Sources

arXiv cs.AI — 2026-05-07