Reasoning Safety Taxonomy for Large Language Models
A recent study published on arXiv establishes reasoning safety as a distinct security aspect separate from content safety in large language models. The researchers present a classification of nine unsafe reasoning behaviors and perform an extensive prevalence analysis, annotating more than 4,000 reasoning chains across benign benchmarks and four advanced reasoning attacks. Their findings provide empirical evidence that all nine unsafe behaviors are present in existing models.
Key facts
- arXiv:2603.25412v2
- Announce Type: replace
- Abstract introduces reasoning safety
- Nine unsafe reasoning behaviors identified
- Over 4,000 reasoning chains annotated
- Benign benchmarks used
- Four state-of-the-art reasoning attacks tested
- All nine behaviors empirically demonstrated
Entities
Institutions
- arXiv