New Arabic Cultural QA Benchmark Tests LLMs on Dialect Variants and Open-Ended Questions

ai-technology · 2026-04-20

A new research method proposes a comprehensive Arabic cultural question-answering benchmark that addresses gaps in large language model performance. The approach translates Modern Standard Arabic multiple-choice questions into English and several Arabic dialects, then converts them into open-ended formats. Researchers benchmarked a range of zero-shot and fine-tuned LLMs under both multiple-choice and open-ended settings. The method also generates chain-of-thought rationales to fine-tune models for step-by-step reasoning. Using this approach, researchers extended an existing dataset where questions and answers are parallelly aligned across multiple language varieties. Extensive experiments were conducted with both open and closed models. Findings reveal that models consistently underperform on Arabic dialects, highlighting persistent gaps in culturally grounded content. The dataset represents what researchers believe to be the first of its kind for Arabic language varieties.

Key facts

Large Language Models show uneven performance on culturally grounded and dialectal content
Method translates Modern Standard Arabic multiple-choice questions into English and Arabic dialects
Questions are converted from multiple-choice to open-ended formats
Researchers benchmark zero-shot and fine-tuned LLMs under both question formats
Chain-of-thought rationales are generated to fine-tune models for step-by-step reasoning
Existing dataset is extended with parallel alignment across multiple language varieties
Extensive experiments conducted with both open and closed models
Models underperform on Arabic dialects, revealing gaps in culturally grounded content

Entities

—

Sources

arXiv cs.AI — 2026-04-20