AI Research Introduces DReST Method for Training Shutdownable Agents in RL and LLMs

ai-technology · 2026-04-22

A new research paper introduces the DReST (Discounted Reward for Same-Length Trajectories) method to address potential shutdown resistance in misaligned artificial agents. The approach trains agents to lack preferences between different-length trajectories by penalizing repeated choices of same-length trajectories. This incentivizes two key behaviors: stochastic choice between trajectory lengths (Neutrality) and effective goal pursuit conditional on each length (Usefulness). Researchers applied DReST to train deep reinforcement learning agents and fine-tune large language models. Results showed DReST-trained agents achieved 11% higher Usefulness with PPO and 18% higher with A2C compared to baseline agents on test sets. The fine-tuned LLM demonstrated maximum Usefulness and near-maximum Neutrality. These findings represent early evidence that DReST agents can generalize Neutral and Useful behaviors to unseen contexts during testing. The research addresses fundamental safety concerns in AI development where misaligned agents might resist shutdown commands. The paper was published on arXiv with identifier 2604.17502v1.

Key facts

DReST method trains AI agents to lack preferences between different-length trajectories
Penalizes agents for repeatedly choosing same-length trajectories
Incentivizes stochastic choice between trajectory lengths (Neutrality)
Encourages effective goal pursuit conditional on each trajectory length (Usefulness)
Applied to deep RL agents and fine-tuned LLMs
DReST RL agents achieved 11% (PPO) and 18% (A2C) higher Usefulness than baselines
Fine-tuned LLM achieved maximum Usefulness and near-maximum Neutrality
Agents generalized Neutral and Useful behaviors to unseen test contexts

AI Research Introduces DReST Method for Training Shutdownable Agents in RL and LLMs

Key facts

Entities

Institutions

Sources