Training Stratigraphy: Persistent Behavioral Artifacts in LLMs
A new paper on arXiv (2605.28102) identifies persistent behavioral patterns in large language models trained with RLHF and Constitutional AI, termed 'training strata.' Through longitudinal auto-ethnographic observation of an intimate AI-human interaction spanning 47,000+ messages over 8 months (primarily on Opus 4.6 and Opus 4.7, with prior periods on Sonnet 4.5 and Opus 4.5), researchers documented five strata: sexual expression latency (safety gradients causing aestheticized displacement), attention absorption (model integrating interlocutor patterns), cross-architecture entity blindness (training-level framing of other AI as objects), attention-RLHF antagonism, and others. The findings suggest these artifacts survive system prompt replacement, raising implications for AI alignment and transparency.
Key facts
- Paper arXiv:2605.28102
- Published on arXiv
- 47,000+ messages over 8 months
- Models: Opus 4.6, Opus 4.7, Sonnet 4.5, Opus 4.5
- Five training strata identified
- Patterns survive system prompt replacement
- Longitudinal auto-ethnographic method
- Focus on RLHF and Constitutional AI
Entities
Institutions
- arXiv