VINA: A Unified Framework for AI-Generated Image and Video Detection

ai-technology · 2026-05-23

A new research paper proposes VINA (Video as Natural Augmentation), a unified framework for detecting AI-generated images and videos. The authors identify a critical failure mode: state-of-the-art AI image detectors often collapse when applied to video frames due to cross-modal gaps from video processing shifts and model-specific fingerprints. VINA jointly trains on image and video data, using video frames as natural augmentations, and introduces cross-modal supervised contrastive learning to bridge the gap. The paper is available on arXiv under ID 2605.21977.

Key facts

arXiv ID: 2605.21977
Paper title: Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
Proposes VINA framework
Identifies failure of SOTA image detectors on video frames
Cross-modal gap from video processing shifts and generator fingerprints
Joint training on image and video data
Uses video frames as natural augmentations
Introduces cross-modal supervised contrastive learning

VINA: A Unified Framework for AI-Generated Image and Video Detection

Key facts

Entities

Institutions

Sources