Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

ai-technology · 2026-04-25

Researchers propose IRM, a zero-shot method for detecting LLM-generated text using implicit reward models derived from instruction-tuned and base models. Unlike prior reward-based approaches, IRM requires no preference collection or additional training. It outperforms existing zero-shot and supervised methods on the DetectRL benchmark. The work addresses concerns about misuse of human-like text generation by large language models.

Key facts

IRM leverages implicit reward models for zero-shot detection.
Implicit reward models are derived from publicly available instruction-tuned and base models.
IRM requires no preference collection or additional training.
IRM outperforms existing zero-shot and supervised methods on the DetectRL benchmark.
The method addresses concerns about misuse of LLM-generated text.
Large language models have demonstrated remarkable capabilities across various tasks.
Previous reward-based method relies on preference construction and task-specific fine-tuning.
IRM is evaluated on the DetectRL benchmark.

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Key facts

Entities

Institutions

Sources