New Benchmark for Semantic Segmentation in Dialectal Arabic
A novel multi-genre standard for semantic segmentation in conversational Arabic has been launched, tackling the difficulties posed by low-resource spoken dialects. This benchmark features more than 1,000 samples that encompass transcribed informal phone calls, code-switched podcasts, broadcast news, and expressive dialogue from literary works, all annotated by native Arabic speakers. While current segmentation models excel in Modern Standard Arabic (MSA) news formats, their effectiveness drops considerably with dialectal transcribed speech. The study suggests a segmentation model focused on enhancing local semantic coherence to boost performance in these dialects.
Key facts
- New multi-genre benchmark for semantic segmentation in conversational Arabic
- Over 1000 samples covering four genres: telephone conversations, podcasts, broadcast news, novels
- Annotated and validated by native Arabic annotators
- Existing models degrade on dialectal transcribed speech compared to MSA news
- Proposed model targets local semantic coherence
Entities
Institutions
- arXiv