ArabCulture-Dialogue: Benchmarking LLMs on Arabic Cultural Reasoning
Researchers have introduced ArabCulture-Dialogue, a conversational dataset designed to evaluate cultural reasoning in large language models (LLMs) across Arabic dialects and Modern Standard Arabic (MSA). The dataset covers 13 Arabic-speaking countries, includes both MSA and each country's respective dialect, and spans 12 daily-life topics with 54 fine-grained subtopics. Three benchmarking tasks were developed: multiple-choice cultural reasoning, machine translation between MSA and dialects, and dialect-steering generation. Experiments reveal a persistent performance gap between MSA and Arabic dialects, with models performing worse on all three tasks in dialectal setups compared to MSA. The work addresses a significant gap in evaluating cultural nuances in LLMs, as most Arabic benchmarks rely on short text snippets in MSA and overlook conversational contexts.
Key facts
- ArabCulture-Dialogue is a conversational dataset for cultural reasoning in Arabic.
- Dataset covers 13 Arabic-speaking countries.
- Includes both Modern Standard Arabic and each country's dialect.
- Spans 12 daily-life topics and 54 fine-grained subtopics.
- Three benchmarking tasks: multiple-choice cultural reasoning, machine translation, dialect-steering generation.
- Experiments show LLMs perform worse on dialectal tasks than on MSA tasks.
- Addresses gap in evaluating cultural nuances in LLMs using conversational data.
- Most Arabic benchmarks focus on short text snippets in MSA.
Entities
Institutions
- arXiv
Locations
- Arabic-speaking countries