MOOD Benchmark Tests LLM Monitors for OOD Alignment Failures
Researchers have introduced a benchmark called Misalignment Out Of Distribution (MOOD) to systematically study whether LLM monitoring pipelines can detect out-of-distribution (OOD) alignment failures. The benchmark includes a restricted training set used to train monitors, plus seven test sets with diverse alignment failures outside the training distribution. Findings show that guard models (safety classifiers) often fail to generalize to OOD scenarios. To address this, the study proposes combining guard models with OOD detectors, testing four types and finding that a combination with Mahalanobis distance performs best. The work aims to improve detection of safety failures caused by unusual prompts or responses unforeseen by developers.
Key facts
- MOOD benchmark introduced to study OOD alignment failure detection in LLMs
- Includes restricted training set and seven OOD test sets
- Guard models fail to generalize to OOD scenarios
- Proposed solution: combine guard models with OOD detectors
- Mahalanobis distance-based combination found effective
- Focus on safety failures from unusual prompt/response patterns
- arXiv paper number: 2605.21602
- Research addresses limitations of off-the-shelf safety classifiers
Entities
Institutions
- arXiv