ARTFEED — Contemporary Art Intelligence

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

ai-technology · 2026-04-25

Researchers propose IRM, a zero-shot method for detecting LLM-generated text using implicit reward models derived from instruction-tuned and base models. Unlike prior reward-based approaches, IRM requires no preference collection or additional training. It outperforms existing zero-shot and supervised methods on the DetectRL benchmark. The work addresses concerns about misuse of human-like text generation by large language models.

Key facts

  • IRM leverages implicit reward models for zero-shot detection.
  • Implicit reward models are derived from publicly available instruction-tuned and base models.
  • IRM requires no preference collection or additional training.
  • IRM outperforms existing zero-shot and supervised methods on the DetectRL benchmark.
  • The method addresses concerns about misuse of LLM-generated text.
  • Large language models have demonstrated remarkable capabilities across various tasks.
  • Previous reward-based method relies on preference construction and task-specific fine-tuning.
  • IRM is evaluated on the DetectRL benchmark.

Entities

Institutions

  • arXiv
  • DetectRL

Sources