QIMMA Arabic LLM Leaderboard Introduces Quality-First Evaluation for Arabic Language Models

ai-technology · 2026-04-21

The newly launched QIMMA قِمّة leaderboard for Arabic LLMs rigorously verifies benchmarks prior to assessing models, ensuring that the scores accurately represent true Arabic language proficiency. It integrates 109 subsets from 14 different benchmarks into a comprehensive evaluation suite comprising over 52,000 samples across seven domains, with 99% of the content being native Arabic. A thorough quality validation process exposed consistent quality issues in commonly used Arabic benchmarks. This process employs multi-model automated assessments using Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B, followed by evaluations from native Arabic speakers. Identified problems include translation inaccuracies, cultural mismatches, inconsistent annotations, and encoding mistakes. As of April 2026, Arabic-focused models like Jais-2-70B-Chat excel in cultural and linguistic tasks, while multilingual models such as Qwen3.5-397B perform well in coding. The evaluation framework incorporates LightEval, EvalPlus, and FannOrFlop, utilizing standardized Arabic prompt templates. Notably, mid-sized models can outperform larger ones in specific areas, addressing the fragmentation in Arabic NLP evaluation for over 400 million speakers with various dialects.

Key facts

QIMMA validates benchmarks before evaluating Arabic LLMs
Consolidates 109 subsets from 14 benchmarks into 52,000+ samples
Uses multi-model assessment with Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B
Includes human annotation by native Arabic speakers
First Arabic leaderboard with code evaluation (HumanEval+, MBPP+)
Results as of April 2026 show Jais-2-70B-Chat leads cultural tasks
Qwen3.5-397B leads coding domains
Addresses evaluation for 400+ million Arabic speakers across diverse dialects

QIMMA Arabic LLM Leaderboard Introduces Quality-First Evaluation for Arabic Language Models

Key facts

Entities

Institutions

Sources