Prompt Optimization Boosts LLM-as-a-Judge in Legal QA
A new study on arXiv (2604.20726) investigates how prompt design and judge selection affect LLM-as-a-Judge evaluations in free-text legal question answering. Using the LEXam benchmark, researchers applied the ProTeGi method to optimize task prompts with feedback from two judges (Qwen3-32B and DeepSeek-V3) across four task models. Automatic optimization consistently outperformed human-centered baselines. Lenient judge feedback yielded higher and more consistent gains than strict feedback. Prompts optimized with lenient feedback transferred better to strict judges than the reverse. Analysis shows lenient judges provide permissive feedback, resulting in prompts with broader applicability.
Key facts
- Study appears on arXiv with ID 2604.20726
- Uses LEXam benchmark for legal QA
- ProTeGi method used for prompt optimization
- Two judges: Qwen3-32B and DeepSeek-V3
- Four task models tested
- Automatic optimization outperforms human-centered design
- Lenient judge feedback yields higher gains
- Lenient-to-strict transfer outperforms reverse
Entities
Institutions
- arXiv