RoLegalGEC Dataset Introduced for Romanian Legal Text Grammatical Error Correction

ai-technology · 2026-04-22

A new dataset called RoLegalGEC has been developed specifically for detecting and correcting grammatical errors in Romanian legal documents. This resource aggregates 350,000 examples of errors found in legal passages, complete with error annotations. The creation of this dataset addresses a significant shortage of manually annotated data for the Romanian language, particularly within the specialized legal domain. Accurate text in legal documents is critically important, necessitating tools that can understand and correct errors within a legal context. Training such tools requires realistic legal data, which has been scarce. While synthetic generation of parallel data is a common alternative, it demands a structured understanding of Romanian grammar. The dataset, introduced in a paper on arXiv (2604.19593v1), is presented as the first of its kind for the Romanian language in this field.

Key facts

The dataset is named RoLegalGEC.
It is designed for grammatical error detection and correction in Romanian legal texts.
The dataset contains 350,000 examples of errors in legal passages.
Each example includes error annotations.
It is described as the first Romanian-language parallel dataset for this specific legal domain task.
The dataset addresses a shortage of manually annotated data for Romanian, especially in niche domains.
The paper announcing it is arXiv:2604.19593v1.
Clear and correct text in legal documents is emphasized as critically important.

RoLegalGEC Dataset Introduced for Romanian Legal Text Grammatical Error Correction

Key facts

Entities

Institutions

Sources