ViLegalNLI: First Large-Scale Vietnamese Legal NLI Dataset

other · 2026-05-04

ViLegalNLI has been launched as the inaugural extensive Vietnamese Natural Language Inference (NLI) dataset specifically designed for the legal field. This dataset includes 42,012 pairs of premises and hypotheses derived from official legal documents, each annotated with binary inference labels (Entailment and Non-entailment). Covering various legal sectors, it embodies realistic scenarios of legal reasoning that incorporate structured logic, conditional clauses, and specialized terminology. A semi-automated framework for data generation was created, utilizing large language models for precise hypothesis creation and thorough quality checks. To improve the reliability of annotations and maintain legal consistency, artifact mitigation techniques and cross-model validation were implemented. The dataset encompasses a range of reasoning styles, such as paraphrasing and logical inference.

Key facts

ViLegalNLI is the first large-scale Vietnamese NLI dataset for the legal domain.
Dataset contains 42,012 premise-hypothesis pairs.
Pairs derived from official statutory documents.
Annotated with binary labels: Entailment and Non-entailment.
Covers multiple legal domains.
Reflects realistic legal reasoning with structured logic and conditional clauses.
Semi-automatic data generation framework uses large language models.
Includes artifact mitigation and cross-model validation for reliability.

Entities

—

Sources

arXiv cs.AI — 2026-05-04