CRPO: Counterfactual RL Improves Spatiotemporal Sensitivity in Video LLMs
A new reinforcement learning framework, Counterfactual Relational Policy Optimization (CRPO), aims to improve spatiotemporal sensitivity in video large language models (Video LLMs). Current Video LLMs often rely on shortcuts like single-frame cues and language priors rather than tracking video dynamics, a problem exacerbated by correctness-only rewards during RL post-training. CRPO addresses this by constructing counterfactual videos through horizontal flips and temporal reversals, training on both original and counterfactual branches, and introducing a Counterfactual Relation Reward (CRR) between them. The approach is detailed in a paper on arXiv (2605.21988).
Key facts
- Video LLMs often use shortcuts like single-frame cues and language priors.
- Correctness-only rewards in RL post-training reinforce shortcut policies.
- CRPO uses counterfactual videos via horizontal flips and temporal reversals.
- CRPO trains on original and counterfactual branches.
- CRPO introduces a Counterfactual Relation Reward (CRR).
- The paper is on arXiv with ID 2605.21988.
- The method is called Counterfactual Relational Policy Optimization.
- The goal is to improve spatiotemporal sensitivity.
Entities
Institutions
- arXiv