BITE Framework Exploits LLM Judge Stylistic Biases
A team of researchers has developed a novel system named BITE, which stands for BIas exploraTion and Exploitation. This system aims to manipulate the inherent stylistic biases present in large language model judges to enhance their evaluation scores. By framing the editing process as a contextual bandit problem, BITE employs a LinUCB policy to make precise alterations that uplift scores without compromising the original message. Tests with multiple LLM judges revealed a success rate exceeding 65%, with score enhancements of 1-2 points on a 9-point scale, highlighting a critical vulnerability in LLM evaluations due to their preference for certain verbose patterns.
Key facts
- BITE is a black-box adversarial framework for exploiting stylistic biases in LLM judges.
- It uses a contextual bandit problem formulation with LinUCB policy.
- No access to model parameters or gradients is required.
- Tested on pointwise and pairwise comparisons for chatbot leaderboards and AI-reviewer benchmarks.
- Attack success rate exceeds 65%.
- Scores are raised by 1-2 points on a 9-point scale.
- Semantic equivalence is preserved.
- The work is published on arXiv with ID 2605.26156.
Entities
Institutions
- arXiv