RTLC: Feynman-inspired prompting lifts LLM-as-judge accuracy on JudgeBench

ai-technology · 2026-05-14

A new prompting paradigm called RTLC (Research, Teach-to-Learn, Critique) significantly improves LLM-as-judge accuracy on the JudgeBench benchmark without fine-tuning, retrieval, or external tools. Inspired by the Feynman Learning Technique, RTLC transforms a single black-box LLM into an ensemble-of-thought judge through three stages: a pedagogical scaffold, multiple independent verdicts (N=10 at temperature 0.4), and a self-critique step (temperature 0). On JudgeBench-GPT, Claude 3.7 Sonnet's pairwise accuracy rose from 64.6% (single-shot vanilla prompt) to higher levels. The method addresses the persistent weakness of LLM judges on objective-correctness pairwise items, where even strong instruction-tuned models barely exceed random chance. The paper is published on arXiv (2605.13695).

Key facts

RTLC stands for Research, Teach-to-Learn, Critique
Inspired by the Feynman Learning Technique
No fine-tuning, retrieval, or external tools required
Uses N=10 independent candidate verdicts at temperature 0.4
Self-critique stage at temperature 0
Tested on JudgeBench-GPT with 350 hard pairwise items
Claude 3.7 Sonnet accuracy improved from 64.6% baseline
Published on arXiv with ID 2605.13695

RTLC: Feynman-inspired prompting lifts LLM-as-judge accuracy on JudgeBench

Key facts

Entities

Institutions

Sources