Transformer Observability Determined by Architecture, Not Scale

other · 2026-04-30

A recent arXiv preprint (2604.24801) explores the monitoring of internal signals in autoregressive transformers for the purpose of error detection. The research characterizes observability as the linear interpretation of per-token decision quality derived from frozen mid-layer activations, while accounting for max-softmax confidence and activation norm. This adjustment is crucial, as confidence controls typically account for an average of 57.7% of the raw probe signal across 13 models from 6 families. Observability is not universally applicable to transformers; in Pythia's controlled experiments, every instance tested with a 24-layer, 16-head setup shows a partial correlation of approximately 0.10 across a 3.5x parameter difference and two Pile variants, while six other configurations maintain a distinct range between 0.21 and 0.38. The output-controlled residual also diminishes at these points, with neither of the tested nonlinearities able to recover it.

Key facts

Observability is defined as linear readability of per-token decision quality from frozen mid-layer activations
Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families
Pythia 24-layer, 16-head configuration collapses to rho_partial ~0.10 across 3.5x parameter gap and two Pile variants
Six other Pythia configurations occupy a healthy band from 0.21 to 0.38
Output-controlled residual collapses at the same points as observability
Neither tested nonlinearity recovers the collapsed signal

Transformer Observability Determined by Architecture, Not Scale

Key facts

Entities

Institutions

Sources