TADDLE: AI Agent Detects Deficient LLM-Generated Peer Reviews
A new tool named TADDLE has been developed by researchers to identify shortcomings in peer reviews generated by large language models (LLMs). This innovation is accompanied by the inaugural expert-annotated benchmark for this purpose, which includes 1,800 reviews of 50 papers submitted to ICLR 2025. These reviews have been multi-label annotated by 18 experts, categorized into six defect types plus a non-deficient label. TADDLE employs four distinct analysis tools—Verify, Correct, Complete, and Transform—managed by an agent, with an integrator that compiles the results into binary classifications. This addresses the increasing difficulty in evaluating LLM-generated reviews, which are fluent yet challenging to assess for quality. The findings are available on arXiv with ID 2605.26911.
Key facts
- TADDLE is a tool-augmented agent for detecting deficient LLM-generated peer reviews.
- The benchmark includes 1,800 reviews on 50 ICLR 2025 papers.
- 18 domain experts annotated the reviews against six defect categories plus a non-deficient label.
- TADDLE uses four specialized analysis tools: Verify, Correct, Complete, and Transform.
- An integrator synthesizes outputs into binary classifications.
- No prior system detects deficiencies in LLM-generated reviews at the level of individual defect types.
- LLM-generated reviews are uniformly fluent and well-structured, making deficiencies hard to detect.
- The work is published on arXiv with ID 2605.26911.
Entities
Institutions
- arXiv
- ICLR