Prompt Optimization Boosts LLM-as-a-Judge in Legal QA

ai-technology · 2026-04-24

A new study on arXiv (2604.20726) investigates how prompt design and judge selection affect LLM-as-a-Judge evaluations in free-text legal question answering. Using the LEXam benchmark, researchers applied the ProTeGi method to optimize task prompts with feedback from two judges (Qwen3-32B and DeepSeek-V3) across four task models. Automatic optimization consistently outperformed human-centered baselines. Lenient judge feedback yielded higher and more consistent gains than strict feedback. Prompts optimized with lenient feedback transferred better to strict judges than the reverse. Analysis shows lenient judges provide permissive feedback, resulting in prompts with broader applicability.

Key facts

Study appears on arXiv with ID 2604.20726
Uses LEXam benchmark for legal QA
ProTeGi method used for prompt optimization
Two judges: Qwen3-32B and DeepSeek-V3
Four task models tested
Automatic optimization outperforms human-centered design
Lenient judge feedback yields higher gains
Lenient-to-strict transfer outperforms reverse

Prompt Optimization Boosts LLM-as-a-Judge in Legal QA

Key facts

Entities

Institutions

Sources