Cross-Temporal Legal NLP Benchmarks Show Severe Performance Decay

other · 2026-05-26

A recent study questions the assumption of stationarity in legal NLP by examining transformer models on Ukrainian court rulings from three distinct geopolitical periods: pre-war (2008–2013), hybrid war (2014–2021), and full-scale invasion (2022–2026). Researchers fine-tuned four transformer encoders—XLM-RoBERTa base and large, along with their legal-domain adaptations—on one period and assessed them across all three, resulting in a 3×3 cross-temporal generalization matrix. The findings reveal significant forward degradation, with models trained on pre-war data experiencing a drop of up to 27.2 percentage points in macro-F1 when used on full-scale invasion rulings. In contrast, backward transfer from full-scale to pre-war is notably stronger, supporting the idea that legal language builds upon itself. The enhancement from legal-domain pretraining was minimal compared to general-domain models, highlighting the necessity for temporal awareness in legal AI systems.

Key facts

Study tests stationarity assumption in legal NLP using Ukrainian court decisions.
Three temporal epochs defined by geopolitical disruptions: pre-war (2008–2013), hybrid war (2014–2021), full-scale invasion (2022–2026).
Four transformer encoders tested: XLM-RoBERTa base, XLM-RoBERTa large, and their legal-domain variants.
Models trained on one epoch and evaluated on all three (3×3 cross-temporal matrix).
Forward degradation: pre-war trained models lose up to 27.2 percentage points macro-F1 on full-scale invasion data.
Backward transfer (full-scale to pre-war) is more robust than forward transfer.
Legal-domain pretraining showed limited benefit over general-domain models.
Results suggest legal language is additive and non-stationary.

Cross-Temporal Legal NLP Benchmarks Show Severe Performance Decay

Key facts

Entities

Locations

Sources