ArabCulture-Dialogue: Benchmarking LLMs on Arabic Cultural Reasoning

ai-technology · 2026-05-04

Researchers have introduced ArabCulture-Dialogue, a conversational dataset designed to evaluate cultural reasoning in large language models (LLMs) across Arabic dialects and Modern Standard Arabic (MSA). The dataset covers 13 Arabic-speaking countries, includes both MSA and each country's respective dialect, and spans 12 daily-life topics with 54 fine-grained subtopics. Three benchmarking tasks were developed: multiple-choice cultural reasoning, machine translation between MSA and dialects, and dialect-steering generation. Experiments reveal a persistent performance gap between MSA and Arabic dialects, with models performing worse on all three tasks in dialectal setups compared to MSA. The work addresses a significant gap in evaluating cultural nuances in LLMs, as most Arabic benchmarks rely on short text snippets in MSA and overlook conversational contexts.

Key facts

ArabCulture-Dialogue is a conversational dataset for cultural reasoning in Arabic.
Dataset covers 13 Arabic-speaking countries.
Includes both Modern Standard Arabic and each country's dialect.
Spans 12 daily-life topics and 54 fine-grained subtopics.
Three benchmarking tasks: multiple-choice cultural reasoning, machine translation, dialect-steering generation.
Experiments show LLMs perform worse on dialectal tasks than on MSA tasks.
Addresses gap in evaluating cultural nuances in LLMs using conversational data.
Most Arabic benchmarks focus on short text snippets in MSA.

Entities

Institutions

arXiv

Locations

Arabic-speaking countries

Sources

arXiv cs.AI — 2026-05-04