Prompt Optimization Skews LLM Evaluation Rankings
A new paper from computer science researchers reveals that prompt optimization (PO) significantly alters the ranking of large language models (LLMs) in evaluations. Current evaluation frameworks use static prompts for all models, unlike industry practice where prompts are optimized per model. The study, using public academic and internal industry benchmarks, shows that failing to optimize prompts before evaluation can produce misleading results. The authors argue that practitioners must perform PO per model when selecting the best LLM for a given task.
Key facts
- Current LLM evaluation frameworks use the same static prompt template across all models.
- Industry practice uses prompt optimization (PO) to maximize application performance per model.
- The paper investigates the effect of PO on LLM evaluations.
- Results show PO greatly affects the final ranking of models.
- The study used public academic and internal industry benchmarks.
- Practitioners should perform PO per model when conducting evaluations.
- The paper is titled 'Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading'.
- It is categorized under Computer Science > Artificial Intelligence.
Entities
Institutions
- arXiv