Symmetry Transfer in LLMs via Layer-Peeled Optimization

other · 2026-05-14

A new study analyzes whether pretraining large language models by minimizing cross-entropy loss for next-token prediction induces geometric structure in learned weights and context embeddings. Using a constrained layer-peeled optimization program as a tractable surrogate, the authors prove that symmetries in target next-token distributions transfer to global minimizers in a group-theoretic sense. Specifically, when target tokens exhibit cyclic-shift symmetry (e.g., days of the week, months of the year), the optimal logit matrix becomes exactly circulant, and Gram matrices of context embeddings reflect the same symmetry. The work provides mathematical foundations for understanding how optimization shapes representations in LLMs.

Key facts

arXiv:2605.12756v1
Study uses layer-peeled optimization as surrogate for LLMs
Focus on cross-entropy loss for next-token prediction
Proves symmetry transfer in group-theoretic sense
Cyclic-shift symmetry leads to circulant logit matrix
Examples: seven days of week, twelve months of year
Gram matrices of context embeddings also reflect symmetry
Nonconvex optimization program analyzed

Symmetry Transfer in LLMs via Layer-Peeled Optimization

Key facts

Entities

Institutions

Sources