LTE: Two-Stage VLM Framework Detects AI-Generated Images with Region-Level Reasoning

ai-technology · 2026-04-24

A new two-stage forensic framework called Locate-Then-Examine (LTE) has been introduced by researchers to identify AI-generated images using vision-language models. The initial phase of LTE focuses on pinpointing suspicious areas, followed by a detailed examination of these regions in conjunction with the entire image to enhance the determination of authenticity. This framework connects its conclusions to specific visual evidence through region proposals and reasoning that is aware of these regions. To facilitate training and assessment, the researchers have developed TRACE, a dataset comprising 20,000 authentic and high-quality synthetic images, complete with region-level annotations and automatically generated forensic explanations from a VLM. This research tackles the issue that conventional one-pass classifiers frequently overlook subtle artifacts in high-quality synthetic images and provide insufficient pixel-level grounding. The paper can be found on arXiv with the identifier 2510.04225.

Key facts

LTE is a two-stage VLM-based forensic framework for detecting AI-generated images.
Stage 1 localizes suspicious regions in the image.
Stage 2 re-examines crops together with the full image to refine the verdict.
LTE explicitly links decisions to localized visual evidence via region proposals and region-aware reasoning.
TRACE dataset contains 20,000 real and synthetic images with region-level annotations and forensic explanations.
TRACE was constructed by a VLM.
Standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images.
Paper available on arXiv: 2510.04225.

LTE: Two-Stage VLM Framework Detects AI-Generated Images with Region-Level Reasoning

Key facts

Entities

Institutions

Sources