Transformer Observability Determined by Architecture, Not Scale
A recent arXiv preprint (2604.24801) explores the monitoring of internal signals in autoregressive transformers for the purpose of error detection. The research characterizes observability as the linear interpretation of per-token decision quality derived from frozen mid-layer activations, while accounting for max-softmax confidence and activation norm. This adjustment is crucial, as confidence controls typically account for an average of 57.7% of the raw probe signal across 13 models from 6 families. Observability is not universally applicable to transformers; in Pythia's controlled experiments, every instance tested with a 24-layer, 16-head setup shows a partial correlation of approximately 0.10 across a 3.5x parameter difference and two Pile variants, while six other configurations maintain a distinct range between 0.21 and 0.38. The output-controlled residual also diminishes at these points, with neither of the tested nonlinearities able to recover it.
Key facts
- Observability is defined as linear readability of per-token decision quality from frozen mid-layer activations
- Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families
- Pythia 24-layer, 16-head configuration collapses to rho_partial ~0.10 across 3.5x parameter gap and two Pile variants
- Six other Pythia configurations occupy a healthy band from 0.21 to 0.38
- Output-controlled residual collapses at the same points as observability
- Neither tested nonlinearity recovers the collapsed signal
Entities
Institutions
- arXiv