ARTFEED — Contemporary Art Intelligence

AI Research Introduces DReST Method for Training Shutdownable Agents in RL and LLMs

ai-technology · 2026-04-22

A new research paper introduces the DReST (Discounted Reward for Same-Length Trajectories) method to address potential shutdown resistance in misaligned artificial agents. The approach trains agents to lack preferences between different-length trajectories by penalizing repeated choices of same-length trajectories. This incentivizes two key behaviors: stochastic choice between trajectory lengths (Neutrality) and effective goal pursuit conditional on each length (Usefulness). Researchers applied DReST to train deep reinforcement learning agents and fine-tune large language models. Results showed DReST-trained agents achieved 11% higher Usefulness with PPO and 18% higher with A2C compared to baseline agents on test sets. The fine-tuned LLM demonstrated maximum Usefulness and near-maximum Neutrality. These findings represent early evidence that DReST agents can generalize Neutral and Useful behaviors to unseen contexts during testing. The research addresses fundamental safety concerns in AI development where misaligned agents might resist shutdown commands. The paper was published on arXiv with identifier 2604.17502v1.

Key facts

  • DReST method trains AI agents to lack preferences between different-length trajectories
  • Penalizes agents for repeatedly choosing same-length trajectories
  • Incentivizes stochastic choice between trajectory lengths (Neutrality)
  • Encourages effective goal pursuit conditional on each trajectory length (Usefulness)
  • Applied to deep RL agents and fine-tuned LLMs
  • DReST RL agents achieved 11% (PPO) and 18% (A2C) higher Usefulness than baselines
  • Fine-tuned LLM achieved maximum Usefulness and near-maximum Neutrality
  • Agents generalized Neutral and Useful behaviors to unseen test contexts

Entities

Institutions

  • arXiv

Sources