MOOD Benchmark Tests LLM Monitors for OOD Alignment Failures

ai-technology · 2026-05-23

Researchers have introduced a benchmark called Misalignment Out Of Distribution (MOOD) to systematically study whether LLM monitoring pipelines can detect out-of-distribution (OOD) alignment failures. The benchmark includes a restricted training set used to train monitors, plus seven test sets with diverse alignment failures outside the training distribution. Findings show that guard models (safety classifiers) often fail to generalize to OOD scenarios. To address this, the study proposes combining guard models with OOD detectors, testing four types and finding that a combination with Mahalanobis distance performs best. The work aims to improve detection of safety failures caused by unusual prompts or responses unforeseen by developers.

Key facts

MOOD benchmark introduced to study OOD alignment failure detection in LLMs
Includes restricted training set and seven OOD test sets
Guard models fail to generalize to OOD scenarios
Proposed solution: combine guard models with OOD detectors
Mahalanobis distance-based combination found effective
Focus on safety failures from unusual prompt/response patterns
arXiv paper number: 2605.21602
Research addresses limitations of off-the-shelf safety classifiers

MOOD Benchmark Tests LLM Monitors for OOD Alignment Failures

Key facts

Entities

Institutions

Sources