JuICE Benchmark Evaluates LLM Judges on Cultural Errors
A group of researchers has introduced JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a new multilingual dataset featuring 7,470 annotations that point out cultural and linguistic errors in extensive outputs from language models. This benchmark aims to address the gap in existing cultural evaluations, which typically treat culture as simple facts and employ LLMs as judges without ensuring they can recognize nuanced cultural mistakes. JuICE contains 1,050 examples across different languages, focusing on errors that may be factually accurate but culturally inappropriate. The results are available in the arXiv preprint 2605.26955.
Key facts
- JuICE is a benchmark for evaluating LLM judges on cultural errors.
- The dataset contains 7,470 span-level annotations.
- It covers 1,050 examples in multiple languages.
- Errors include cultural and linguistic inaccuracies in long-form LLM responses.
- Existing benchmarks treat culture as flat facts via fact verification or norm entailment.
- LLM-as-a-Judge is commonly used without validation for cultural sensitivity.
- The research is published on arXiv with ID 2605.26955.
- The goal is to improve LLM performance across diverse cultural contexts.
Entities
Institutions
- arXiv