LLMs Develop Hierarchical Processing Layers Based on Architecture
A recent investigation of eight Transformer models, ranging from 7B to 70B parameters, within the Llama and Qwen families, shows that each model naturally establishes distinct functional boundaries that categorize their layers into Local, Intermediate, and Global processing segments. The positioning of these boundaries, along with the fragility of each segment, is primarily influenced by the architecture family rather than the size of the model or its training setup. Researchers have introduced the Multi-Scale Probabilistic Generation Theory (MSPGT), conceptualizing an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system. All eight models strongly validate three predictions.
Key facts
- Eight Transformer models from Llama and Qwen families were analyzed
- Models range from 7B to 70B parameters
- All models develop Local, Intermediate, and Global processing segments
- Boundary locations depend on architecture family, not model size
- MSPGT formalizes the hierarchical structure
- Three predictions of MSPGT are confirmed
Entities
—