CaC: Hierarchical Spatiotemporal Reward Model for Video Anomaly Detection

ai-technology · 2026-05-13

Researchers have introduced Concentrate and Concentrate (CaC), an anomaly reward model that transitions from coarse to fine, leveraging Vision-Language Models. In the inference phase, it conducts a comprehensive temporal analysis to identify anomalous time segments, followed by detailed spatial grounding within these defined intervals, employing structured spatiotemporal Chain-of-Thought reasoning for robust conclusions. To support this model, the team created the first extensive video anomaly dataset, featuring per-frame bounding-box annotations, temporal anomaly windows, and detailed attribution labels. Their training framework is structured in three stages: the model first acquires spatial and temporal anchoring through both single- and multi-frame supervised fine-tuning, then undergoes reinforcement learning using two-turn Group Relative Policy Optimization (GRPO), enhancing video reward models beyond traditional accuracy measures.

Key facts

CaC is a coarse-to-fine anomaly reward model based on Vision-Language Models
Inference includes global temporal scan, fine-grained spatial grounding, and spatiotemporal Chain-of-Thought reasoning
First large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels
Three-stage progressive training: supervised fine-tuning (single- and multi-frame) then reinforcement learning via two-turn GRPO
Published on arXiv with ID 2605.11723
Announce type is cross

CaC: Hierarchical Spatiotemporal Reward Model for Video Anomaly Detection

Key facts

Entities

Institutions

Sources