RTLC: Feynman-inspired prompting lifts LLM-as-judge accuracy on JudgeBench
A new prompting paradigm called RTLC (Research, Teach-to-Learn, Critique) significantly improves LLM-as-judge accuracy on the JudgeBench benchmark without fine-tuning, retrieval, or external tools. Inspired by the Feynman Learning Technique, RTLC transforms a single black-box LLM into an ensemble-of-thought judge through three stages: a pedagogical scaffold, multiple independent verdicts (N=10 at temperature 0.4), and a self-critique step (temperature 0). On JudgeBench-GPT, Claude 3.7 Sonnet's pairwise accuracy rose from 64.6% (single-shot vanilla prompt) to higher levels. The method addresses the persistent weakness of LLM judges on objective-correctness pairwise items, where even strong instruction-tuned models barely exceed random chance. The paper is published on arXiv (2605.13695).
Key facts
- RTLC stands for Research, Teach-to-Learn, Critique
- Inspired by the Feynman Learning Technique
- No fine-tuning, retrieval, or external tools required
- Uses N=10 independent candidate verdicts at temperature 0.4
- Self-critique stage at temperature 0
- Tested on JudgeBench-GPT with 350 hard pairwise items
- Claude 3.7 Sonnet accuracy improved from 64.6% baseline
- Published on arXiv with ID 2605.13695
Entities
Institutions
- arXiv