OpenAI o1 System Card Details Chain-of-Thought Safety
OpenAI released the o1 model system card on arXiv, detailing its training with large-scale reinforcement learning to reason via chain of thought. The report highlights deliberative alignment, where models reason about safety policies in context when responding to potentially unsafe prompts. This yields state-of-the-art performance on benchmarks for risks like illicit advice, stereotyped responses, and jailbreaks. The paper notes that chain-of-thought reasoning unlocks benefits but also increases risks from heightened intelligence, emphasizing the need for robust alignment methods, stress-testing, and risk management protocols.
Key facts
- The o1 model series uses large-scale reinforcement learning for chain-of-thought reasoning.
- Deliberative alignment allows models to reason about safety policies in context.
- State-of-the-art performance on benchmarks for illicit advice, stereotyped responses, and jailbreaks.
- Chain-of-thought reasoning increases both benefits and risks from heightened intelligence.
- The report underscores the need for robust alignment methods and risk management.
- The system card was published on arXiv under ID 2412.16720.
- The paper is titled 'OpenAI o1 System Card'.
- The research focuses on improving safety and robustness through reasoning.
Entities
Institutions
- OpenAI
- arXiv