Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Researchers have unveiled Prosa, the inaugural multi-turn chat benchmark for Brazilian Portuguese users, featuring 1,000 conversations from WildChat. The findings indicate that employing binary rubric scoring with multi-judge filtering effectively removes biases associated with judge models, a limitation seen in holistic scoring. With filtered rubric scoring, consensus among three judges from diverse model families was achieved for all 16 model ranks, in contrast to only 7 ranks under holistic scoring. Additionally, the filtering pipeline boosts the average score difference between adjacent models by 47%, improving discriminative capabilities. Evaluating a new model using Prosa costs around $2.1 with Gemini 3 Flash as the evaluator. The benchmark and filtering code have been made publicly available.
Key facts
- Prosa is the first real user multi-turn Brazilian Portuguese chat benchmark.
- It includes 1,000 WildChat conversations.
- Three judges from three model families scored 16 models.
- Binary rubric scoring with multi-judge filtering achieves full agreement on all 16 ranks.
- Holistic scoring only agrees on 7 of 16 ranks.
- Rubric filtering increases average score gap between neighboring models by 47%.
- Evaluation cost is about $2.1 using Gemini 3 Flash.
- Benchmark and filtering code are released.
Entities
Institutions
- arXiv
Locations
- Brazil