LLM Agent Constraint Decay: Prohibitions Fail Under Long Contexts

ai-technology · 2026-04-25

A recent study published on arXiv (2604.20911) indicates that LLM agents in production settings face uneven degradation of behavioral guidelines when dealing with lengthy contexts. Specifically, prohibitive instructions (such as avoiding credential sharing) significantly weaken, whereas directive instructions (like mandated actions) remain stable. This effect, identified as Security-Recall Divergence (SRD), was evaluated through a three-arm causal study involving 4,416 trials across 12 models from 8 providers at various conversation depths. For Mistral Large 3, compliance with prohibitions fell from 73% at turn 5 to 33% at turn 16, while compliance with directives stayed at 100% (p < 10^-33). Token-matched padding controls revealed that schema semantic content contributes 62-100% to the dilution effect. Reintroducing constraints prior to the model-specific Safe Turn Depth (STD) can restore compliance without the need for retraining, underscoring a significant vulnerability in long-context LLM agents, as production security measures typically rely on prohibitions that weaken under contextual strain.

Key facts

Study on arXiv:2604.20911
Prohibition-type constraints decay under long contexts
Commission-type constraints persist
Termed Security-Recall Divergence (SRD)
4,416-trial three-arm causal study
12 models and 8 providers tested
Six conversation depths
Mistral Large 3: omission compliance 73% at turn 5, 33% at turn 16
Commission compliance 100% for Mistral Large 3
p < 10^-33
Schema semantic content accounts for 62-100% of dilution effect
Re-injecting constraints before Safe Turn Depth (STD) restores compliance

LLM Agent Constraint Decay: Prohibitions Fail Under Long Contexts

Key facts

Entities

Institutions

Sources