ARTFEED — Contemporary Art Intelligence

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

ai-technology · 2026-05-06

Researchers have unveiled Prosa, the inaugural multi-turn chat benchmark for Brazilian Portuguese users, featuring 1,000 conversations from WildChat. The findings indicate that employing binary rubric scoring with multi-judge filtering effectively removes biases associated with judge models, a limitation seen in holistic scoring. With filtered rubric scoring, consensus among three judges from diverse model families was achieved for all 16 model ranks, in contrast to only 7 ranks under holistic scoring. Additionally, the filtering pipeline boosts the average score difference between adjacent models by 47%, improving discriminative capabilities. Evaluating a new model using Prosa costs around $2.1 with Gemini 3 Flash as the evaluator. The benchmark and filtering code have been made publicly available.

Key facts

  • Prosa is the first real user multi-turn Brazilian Portuguese chat benchmark.
  • It includes 1,000 WildChat conversations.
  • Three judges from three model families scored 16 models.
  • Binary rubric scoring with multi-judge filtering achieves full agreement on all 16 ranks.
  • Holistic scoring only agrees on 7 of 16 ranks.
  • Rubric filtering increases average score gap between neighboring models by 47%.
  • Evaluation cost is about $2.1 using Gemini 3 Flash.
  • Benchmark and filtering code are released.

Entities

Institutions

  • arXiv

Locations

  • Brazil

Sources