MORPHOGEN Benchmark Tests Gender-Aware Morphological Generation in Multilingual LLMs

ai-technology · 2026-04-22

A newly introduced benchmark dataset, MORPHOGEN, assesses the proficiency of multilingual large language models in managing grammatical gender and morphological agreement across three linguistically diverse languages: Hindi, Arabic, and French. Researchers developed a comprehensive synthetic dataset to evaluate 15 widely used multilingual LLMs, which vary in size from 2B to 70B parameters. The primary task, GENFORM, challenges models to transform a first-person sentence into the opposite gender while maintaining its meaning and structure. Although multilingual LLMs excel at tasks such as translation and question answering, their handling of grammatical gender—impacting verb conjugation, pronouns, and first-person forms—has not been thoroughly examined. This study uncovers notable deficiencies in gender-aware morphological generation capabilities. The dataset was published on arXiv under identifier 2604.18914v1 as a type of cross-announcement.

Key facts

MORPHOGEN is a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation
It tests three typologically diverse grammatically gendered languages: French, Arabic, and Hindi
The core task GENFORM requires rewriting first-person sentences in the opposite gender while preserving meaning
Researchers benchmarked 15 popular multilingual LLMs ranging from 2B to 70B parameters
The study reveals significant gaps in models' gender-aware morphological generation capabilities
Grammatical gender influences verb conjugation, pronouns, and first-person constructions in morphologically rich languages
The dataset is synthetic and high-quality, spanning the three target languages
The research addresses underexplored aspects of multilingual LLM performance beyond high-level tasks

MORPHOGEN Benchmark Tests Gender-Aware Morphological Generation in Multilingual LLMs

Key facts

Entities

Locations

Sources