Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model
Researchers propose IRM, a zero-shot method for detecting LLM-generated text using implicit reward models derived from instruction-tuned and base models. Unlike prior reward-based approaches, IRM requires no preference collection or additional training. It outperforms existing zero-shot and supervised methods on the DetectRL benchmark. The work addresses concerns about misuse of human-like text generation by large language models.
Key facts
- IRM leverages implicit reward models for zero-shot detection.
- Implicit reward models are derived from publicly available instruction-tuned and base models.
- IRM requires no preference collection or additional training.
- IRM outperforms existing zero-shot and supervised methods on the DetectRL benchmark.
- The method addresses concerns about misuse of LLM-generated text.
- Large language models have demonstrated remarkable capabilities across various tasks.
- Previous reward-based method relies on preference construction and task-specific fine-tuning.
- IRM is evaluated on the DetectRL benchmark.
Entities
Institutions
- arXiv
- DetectRL