Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

ai-technology · 2026-05-06

Researchers have unveiled Prosa, the inaugural multi-turn chat benchmark for Brazilian Portuguese users, featuring 1,000 conversations from WildChat. The findings indicate that employing binary rubric scoring with multi-judge filtering effectively removes biases associated with judge models, a limitation seen in holistic scoring. With filtered rubric scoring, consensus among three judges from diverse model families was achieved for all 16 model ranks, in contrast to only 7 ranks under holistic scoring. Additionally, the filtering pipeline boosts the average score difference between adjacent models by 47%, improving discriminative capabilities. Evaluating a new model using Prosa costs around $2.1 with Gemini 3 Flash as the evaluator. The benchmark and filtering code have been made publicly available.

Key facts

Prosa is the first real user multi-turn Brazilian Portuguese chat benchmark.
It includes 1,000 WildChat conversations.
Three judges from three model families scored 16 models.
Binary rubric scoring with multi-judge filtering achieves full agreement on all 16 ranks.
Holistic scoring only agrees on 7 of 16 ranks.
Rubric filtering increases average score gap between neighboring models by 47%.
Evaluation cost is about $2.1 using Gemini 3 Flash.
Benchmark and filtering code are released.

Entities

Institutions

arXiv

Locations

Brazil

Sources

arXiv cs.AI — 2026-05-05