BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

ai-technology · 2026-05-09

A new framework called BehaviorGuard proposes an online, trigger-agnostic defense against backdoor attacks in deep reinforcement learning (DRL). Unlike existing methods that rely on reward anomalies and model fine-tuning, BehaviorGuard detects backdoors by monitoring shifts in action distributions, even without triggers. It identifies suspicious behavior in high-quantile regions and distribution tails, then suppresses backdoor actions at runtime. The approach aims to reduce costs and improve robustness against complex trigger patterns. The paper is published on arXiv under ID 2605.05977.

Key facts

BehaviorGuard is an online behavior-based backdoor detection and mitigation framework for DRL.
It is trigger-agnostic, detecting backdoors via shifts in action distributions.
Backdoored policies leave detectable traces in high-quantile regions and distribution tails.
The framework suppresses backdoor actions at runtime.
It aims to overcome limitations of current defenses that rely on reward anomalies and fine-tuning.
The paper is available on arXiv with ID 2605.05977.

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

Key facts

Entities

Institutions

Sources