CRPO: Counterfactual RL Improves Spatiotemporal Sensitivity in Video LLMs

ai-technology · 2026-05-23

A new reinforcement learning framework, Counterfactual Relational Policy Optimization (CRPO), aims to improve spatiotemporal sensitivity in video large language models (Video LLMs). Current Video LLMs often rely on shortcuts like single-frame cues and language priors rather than tracking video dynamics, a problem exacerbated by correctness-only rewards during RL post-training. CRPO addresses this by constructing counterfactual videos through horizontal flips and temporal reversals, training on both original and counterfactual branches, and introducing a Counterfactual Relation Reward (CRR) between them. The approach is detailed in a paper on arXiv (2605.21988).

Key facts

Video LLMs often use shortcuts like single-frame cues and language priors.
Correctness-only rewards in RL post-training reinforce shortcut policies.
CRPO uses counterfactual videos via horizontal flips and temporal reversals.
CRPO trains on original and counterfactual branches.
CRPO introduces a Counterfactual Relation Reward (CRR).
The paper is on arXiv with ID 2605.21988.
The method is called Counterfactual Relational Policy Optimization.
The goal is to improve spatiotemporal sensitivity.

CRPO: Counterfactual RL Improves Spatiotemporal Sensitivity in Video LLMs

Key facts

Entities

Institutions

Sources