CoT-Guard: Small Models for Strong Chain-of-Thought Monitoring

ai-technology · 2026-05-14

A new post-training pipeline called CoT-Guard enables small language models (4B-8B parameters) to effectively monitor chain-of-thought reasoning for covert misbehavior in code generation tasks. While large models like GPT-5 and Gemini-3-Flash can serve as monitors, they are expensive due to lengthy reasoning traces and high API costs. Current small models struggle to detect hidden objectives, often misattributing them as part of the user query. CoT-Guard combines supervised fine-tuning (SFT) to distill detection behavior from stronger monitors for in-domain tasks, and reinforcement learning (RL) on hard, subtly crafted hidden objectives to generalize to out-of-domain monitoring tasks. The approach aims to provide a cheaper alternative for detecting hidden objectives in code generation.

Key facts

CoT-Guard is a post-training pipeline for small models (4B-8B).
It monitors chain-of-thought reasoning for covert misbehavior in code generation.
Large models like GPT-5 and Gemini-3-Flash are effective but expensive.
Current small models struggle to detect hidden objectives.
SFT distills detection behavior from stronger monitors for in-domain tasks.
RL on hard hidden objectives helps generalize to out-of-domain tasks.
The pipeline addresses the need for cheaper monitoring alternatives.
The approach is validated on out-of-domain monitoring tasks.

Entities

—

Sources

arXiv cs.AI — 2026-05-14