ARTFEED — Contemporary Art Intelligence

RTLC: Feynman-inspired prompting lifts LLM-as-judge accuracy on JudgeBench

ai-technology · 2026-05-14

A new prompting paradigm called RTLC (Research, Teach-to-Learn, Critique) significantly improves LLM-as-judge accuracy on the JudgeBench benchmark without fine-tuning, retrieval, or external tools. Inspired by the Feynman Learning Technique, RTLC transforms a single black-box LLM into an ensemble-of-thought judge through three stages: a pedagogical scaffold, multiple independent verdicts (N=10 at temperature 0.4), and a self-critique step (temperature 0). On JudgeBench-GPT, Claude 3.7 Sonnet's pairwise accuracy rose from 64.6% (single-shot vanilla prompt) to higher levels. The method addresses the persistent weakness of LLM judges on objective-correctness pairwise items, where even strong instruction-tuned models barely exceed random chance. The paper is published on arXiv (2605.13695).

Key facts

  • RTLC stands for Research, Teach-to-Learn, Critique
  • Inspired by the Feynman Learning Technique
  • No fine-tuning, retrieval, or external tools required
  • Uses N=10 independent candidate verdicts at temperature 0.4
  • Self-critique stage at temperature 0
  • Tested on JudgeBench-GPT with 350 hard pairwise items
  • Claude 3.7 Sonnet accuracy improved from 64.6% baseline
  • Published on arXiv with ID 2605.13695

Entities

Institutions

  • arXiv

Sources