Neuro-Symbolic PPO Enhances Deep Reinforcement Learning Efficiency

ai-technology · 2026-04-30

A team of researchers has introduced a neuro-symbolic enhancement of Proximal Policy Optimization (PPO) aimed at transferring partial logical policy specifications from simpler scenarios to facilitate learning in more challenging environments. They present two new approaches: H-PPO-Product, which modifies action distribution during sampling, and H-PPO-SymLoss, which incorporates a symbolic regularization component into the PPO loss function. When tested on benchmarks like OfficeWorld, WaterWorld, and DoorKey, these methods demonstrated significantly quicker learning and improved returns at convergence compared to both PPO and a Reward Machine baseline, even when symbolic knowledge was incomplete. This research tackles the issues of data inefficiency in DRL, particularly in relation to sparse rewards and extended planning horizons.

Key facts

Proposes neuro-symbolic extension of Proximal Policy Optimization (PPO)
Transfers partial logical policy specifications from easier to harder instances
Two integrations: H-PPO-Product (biases action distribution) and H-PPO-SymLoss (symbolic regularization)
Evaluated on OfficeWorld, WaterWorld, and DoorKey benchmarks
Shows faster learning and higher return at convergence than PPO and Reward Machine baseline
Works under imperfect symbolic knowledge
Addresses data inefficiency and sparse-reward domains in DRL

Entities

—

Sources

arXiv cs.AI — 2026-04-29