MoralityGym Benchmark Tests Moral Alignment in AI Agents
A team of researchers has unveiled MoralityGym, a set of 98 ethical dilemma scenarios modeled after the trolley problem, aimed at assessing the moral alignment of agents in sequential decision-making. This benchmark employs an innovative framework known as Morality Chains, which articulates moral principles as a sequence of deontic constraints. By separating the process of task completion from moral judgment, MoralityGym integrates findings from both psychology and philosophy to enhance norm-sensitive reasoning. Initial results using Safe RL techniques indicate significant shortcomings, underscoring the necessity for more principled methods in ethical decision-making. Ultimately, this work seeks to create AI systems that operate with greater reliability, transparency, and ethical standards in complex real-world situations.
Key facts
- MoralityGym is a benchmark of 98 ethical-dilemma problems.
- The problems are presented as trolley-dilemma-style Gymnasium environments.
- Morality Chains is a novel formalism for representing moral norms as ordered deontic constraints.
- The benchmark decouples task-solving from moral evaluation.
- A novel Morality Metric is introduced.
- Baseline results with Safe RL methods show key limitations.
- The work is at the intersection of AI safety, moral philosophy, and cognitive science.
- The goal is to develop AI systems that behave more reliably, transparently, and ethically.
Entities
Institutions
- arXiv