ARTFEED — Contemporary Art Intelligence

CRPO: Counterfactual RL Improves Spatiotemporal Sensitivity in Video LLMs

ai-technology · 2026-05-23

A new reinforcement learning framework, Counterfactual Relational Policy Optimization (CRPO), aims to improve spatiotemporal sensitivity in video large language models (Video LLMs). Current Video LLMs often rely on shortcuts like single-frame cues and language priors rather than tracking video dynamics, a problem exacerbated by correctness-only rewards during RL post-training. CRPO addresses this by constructing counterfactual videos through horizontal flips and temporal reversals, training on both original and counterfactual branches, and introducing a Counterfactual Relation Reward (CRR) between them. The approach is detailed in a paper on arXiv (2605.21988).

Key facts

  • Video LLMs often use shortcuts like single-frame cues and language priors.
  • Correctness-only rewards in RL post-training reinforce shortcut policies.
  • CRPO uses counterfactual videos via horizontal flips and temporal reversals.
  • CRPO trains on original and counterfactual branches.
  • CRPO introduces a Counterfactual Relation Reward (CRR).
  • The paper is on arXiv with ID 2605.21988.
  • The method is called Counterfactual Relational Policy Optimization.
  • The goal is to improve spatiotemporal sensitivity.

Entities

Institutions

  • arXiv

Sources