LLM Agent Constraint Decay: Prohibitions Fail Under Long Contexts
A recent study published on arXiv (2604.20911) indicates that LLM agents in production settings face uneven degradation of behavioral guidelines when dealing with lengthy contexts. Specifically, prohibitive instructions (such as avoiding credential sharing) significantly weaken, whereas directive instructions (like mandated actions) remain stable. This effect, identified as Security-Recall Divergence (SRD), was evaluated through a three-arm causal study involving 4,416 trials across 12 models from 8 providers at various conversation depths. For Mistral Large 3, compliance with prohibitions fell from 73% at turn 5 to 33% at turn 16, while compliance with directives stayed at 100% (p < 10^-33). Token-matched padding controls revealed that schema semantic content contributes 62-100% to the dilution effect. Reintroducing constraints prior to the model-specific Safe Turn Depth (STD) can restore compliance without the need for retraining, underscoring a significant vulnerability in long-context LLM agents, as production security measures typically rely on prohibitions that weaken under contextual strain.
Key facts
- Study on arXiv:2604.20911
- Prohibition-type constraints decay under long contexts
- Commission-type constraints persist
- Termed Security-Recall Divergence (SRD)
- 4,416-trial three-arm causal study
- 12 models and 8 providers tested
- Six conversation depths
- Mistral Large 3: omission compliance 73% at turn 5, 33% at turn 16
- Commission compliance 100% for Mistral Large 3
- p < 10^-33
- Schema semantic content accounts for 62-100% of dilution effect
- Re-injecting constraints before Safe Turn Depth (STD) restores compliance
Entities
Institutions
- arXiv