VINA: A Unified Framework for AI-Generated Image and Video Detection
A new research paper proposes VINA (Video as Natural Augmentation), a unified framework for detecting AI-generated images and videos. The authors identify a critical failure mode: state-of-the-art AI image detectors often collapse when applied to video frames due to cross-modal gaps from video processing shifts and model-specific fingerprints. VINA jointly trains on image and video data, using video frames as natural augmentations, and introduces cross-modal supervised contrastive learning to bridge the gap. The paper is available on arXiv under ID 2605.21977.
Key facts
- arXiv ID: 2605.21977
- Paper title: Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
- Proposes VINA framework
- Identifies failure of SOTA image detectors on video frames
- Cross-modal gap from video processing shifts and generator fingerprints
- Joint training on image and video data
- Uses video frames as natural augmentations
- Introduces cross-modal supervised contrastive learning
Entities
Institutions
- arXiv