CaC: Hierarchical Spatiotemporal Reward Model for Video Anomaly Detection
Researchers have introduced Concentrate and Concentrate (CaC), an anomaly reward model that transitions from coarse to fine, leveraging Vision-Language Models. In the inference phase, it conducts a comprehensive temporal analysis to identify anomalous time segments, followed by detailed spatial grounding within these defined intervals, employing structured spatiotemporal Chain-of-Thought reasoning for robust conclusions. To support this model, the team created the first extensive video anomaly dataset, featuring per-frame bounding-box annotations, temporal anomaly windows, and detailed attribution labels. Their training framework is structured in three stages: the model first acquires spatial and temporal anchoring through both single- and multi-frame supervised fine-tuning, then undergoes reinforcement learning using two-turn Group Relative Policy Optimization (GRPO), enhancing video reward models beyond traditional accuracy measures.
Key facts
- CaC is a coarse-to-fine anomaly reward model based on Vision-Language Models
- Inference includes global temporal scan, fine-grained spatial grounding, and spatiotemporal Chain-of-Thought reasoning
- First large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels
- Three-stage progressive training: supervised fine-tuning (single- and multi-frame) then reinforcement learning via two-turn GRPO
- Published on arXiv with ID 2605.11723
- Announce type is cross
Entities
Institutions
- arXiv