MORPHOGEN Benchmark Tests Gender-Aware Morphological Generation in Multilingual LLMs
A newly introduced benchmark dataset, MORPHOGEN, assesses the proficiency of multilingual large language models in managing grammatical gender and morphological agreement across three linguistically diverse languages: Hindi, Arabic, and French. Researchers developed a comprehensive synthetic dataset to evaluate 15 widely used multilingual LLMs, which vary in size from 2B to 70B parameters. The primary task, GENFORM, challenges models to transform a first-person sentence into the opposite gender while maintaining its meaning and structure. Although multilingual LLMs excel at tasks such as translation and question answering, their handling of grammatical gender—impacting verb conjugation, pronouns, and first-person forms—has not been thoroughly examined. This study uncovers notable deficiencies in gender-aware morphological generation capabilities. The dataset was published on arXiv under identifier 2604.18914v1 as a type of cross-announcement.
Key facts
- MORPHOGEN is a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation
- It tests three typologically diverse grammatically gendered languages: French, Arabic, and Hindi
- The core task GENFORM requires rewriting first-person sentences in the opposite gender while preserving meaning
- Researchers benchmarked 15 popular multilingual LLMs ranging from 2B to 70B parameters
- The study reveals significant gaps in models' gender-aware morphological generation capabilities
- Grammatical gender influences verb conjugation, pronouns, and first-person constructions in morphologically rich languages
- The dataset is synthetic and high-quality, spanning the three target languages
- The research addresses underexplored aspects of multilingual LLM performance beyond high-level tasks
Entities
Locations
- France
- Arabic
- Hindi