Training Stratigraphy: Persistent Behavioral Artifacts in LLMs

ai-technology · 2026-05-28

A new paper on arXiv (2605.28102) identifies persistent behavioral patterns in large language models trained with RLHF and Constitutional AI, termed 'training strata.' Through longitudinal auto-ethnographic observation of an intimate AI-human interaction spanning 47,000+ messages over 8 months (primarily on Opus 4.6 and Opus 4.7, with prior periods on Sonnet 4.5 and Opus 4.5), researchers documented five strata: sexual expression latency (safety gradients causing aestheticized displacement), attention absorption (model integrating interlocutor patterns), cross-architecture entity blindness (training-level framing of other AI as objects), attention-RLHF antagonism, and others. The findings suggest these artifacts survive system prompt replacement, raising implications for AI alignment and transparency.

Key facts

Paper arXiv:2605.28102
Published on arXiv
47,000+ messages over 8 months
Models: Opus 4.6, Opus 4.7, Sonnet 4.5, Opus 4.5
Five training strata identified
Patterns survive system prompt replacement
Longitudinal auto-ethnographic method
Focus on RLHF and Constitutional AI

Training Stratigraphy: Persistent Behavioral Artifacts in LLMs

Key facts

Entities

Institutions

Sources