ARTFEED — Contemporary Art Intelligence

ArabCulture-Dialogue: Benchmarking LLMs on Arabic Cultural Reasoning

ai-technology · 2026-05-04

Researchers have introduced ArabCulture-Dialogue, a conversational dataset designed to evaluate cultural reasoning in large language models (LLMs) across Arabic dialects and Modern Standard Arabic (MSA). The dataset covers 13 Arabic-speaking countries, includes both MSA and each country's respective dialect, and spans 12 daily-life topics with 54 fine-grained subtopics. Three benchmarking tasks were developed: multiple-choice cultural reasoning, machine translation between MSA and dialects, and dialect-steering generation. Experiments reveal a persistent performance gap between MSA and Arabic dialects, with models performing worse on all three tasks in dialectal setups compared to MSA. The work addresses a significant gap in evaluating cultural nuances in LLMs, as most Arabic benchmarks rely on short text snippets in MSA and overlook conversational contexts.

Key facts

  • ArabCulture-Dialogue is a conversational dataset for cultural reasoning in Arabic.
  • Dataset covers 13 Arabic-speaking countries.
  • Includes both Modern Standard Arabic and each country's dialect.
  • Spans 12 daily-life topics and 54 fine-grained subtopics.
  • Three benchmarking tasks: multiple-choice cultural reasoning, machine translation, dialect-steering generation.
  • Experiments show LLMs perform worse on dialectal tasks than on MSA tasks.
  • Addresses gap in evaluating cultural nuances in LLMs using conversational data.
  • Most Arabic benchmarks focus on short text snippets in MSA.

Entities

Institutions

  • arXiv

Locations

  • Arabic-speaking countries

Sources