ARTFEED — Contemporary Art Intelligence

Cross-Temporal Legal NLP Benchmarks Show Severe Performance Decay

other · 2026-05-26

A recent study questions the assumption of stationarity in legal NLP by examining transformer models on Ukrainian court rulings from three distinct geopolitical periods: pre-war (2008–2013), hybrid war (2014–2021), and full-scale invasion (2022–2026). Researchers fine-tuned four transformer encoders—XLM-RoBERTa base and large, along with their legal-domain adaptations—on one period and assessed them across all three, resulting in a 3×3 cross-temporal generalization matrix. The findings reveal significant forward degradation, with models trained on pre-war data experiencing a drop of up to 27.2 percentage points in macro-F1 when used on full-scale invasion rulings. In contrast, backward transfer from full-scale to pre-war is notably stronger, supporting the idea that legal language builds upon itself. The enhancement from legal-domain pretraining was minimal compared to general-domain models, highlighting the necessity for temporal awareness in legal AI systems.

Key facts

  • Study tests stationarity assumption in legal NLP using Ukrainian court decisions.
  • Three temporal epochs defined by geopolitical disruptions: pre-war (2008–2013), hybrid war (2014–2021), full-scale invasion (2022–2026).
  • Four transformer encoders tested: XLM-RoBERTa base, XLM-RoBERTa large, and their legal-domain variants.
  • Models trained on one epoch and evaluated on all three (3×3 cross-temporal matrix).
  • Forward degradation: pre-war trained models lose up to 27.2 percentage points macro-F1 on full-scale invasion data.
  • Backward transfer (full-scale to pre-war) is more robust than forward transfer.
  • Legal-domain pretraining showed limited benefit over general-domain models.
  • Results suggest legal language is additive and non-stationary.

Entities

Locations

  • Ukraine

Sources