ARTFEED — Contemporary Art Intelligence

VINA: A Unified Framework for AI-Generated Image and Video Detection

ai-technology · 2026-05-23

A new research paper proposes VINA (Video as Natural Augmentation), a unified framework for detecting AI-generated images and videos. The authors identify a critical failure mode: state-of-the-art AI image detectors often collapse when applied to video frames due to cross-modal gaps from video processing shifts and model-specific fingerprints. VINA jointly trains on image and video data, using video frames as natural augmentations, and introduces cross-modal supervised contrastive learning to bridge the gap. The paper is available on arXiv under ID 2605.21977.

Key facts

  • arXiv ID: 2605.21977
  • Paper title: Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
  • Proposes VINA framework
  • Identifies failure of SOTA image detectors on video frames
  • Cross-modal gap from video processing shifts and generator fingerprints
  • Joint training on image and video data
  • Uses video frames as natural augmentations
  • Introduces cross-modal supervised contrastive learning

Entities

Institutions

  • arXiv

Sources