ARTFEED — Contemporary Art Intelligence

SCARV: Stable Sample Ranking for Redundant NLP Datasets

other · 2026-05-06

A novel framework named SCARV (Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets) has been introduced to tackle the issue of inconsistent sample-level rankings in NLP datasets featuring duplicates and near-duplicates. Current methodologies evaluate training examples individually, treating them as separate entities, which proves inadequate in the presence of redundancy. SCARV integrates resilient multi-seed aggregation with structure-aware aggregation across redundancy clusters. It underwent testing on synthetic redundancy, naturally sourced QQP redundancy, various proxy families, multiple NLP tasks, and complete DistilBERT fine-tuning. The research paper can be found on arXiv under ID 2605.00944.

Key facts

  • SCARV stands for Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
  • Addresses unstable sample-level rankings due to duplicates, near-duplicates, and paraphrases
  • Combines multi-seed aggregation with structure-aware aggregation over redundancy clusters
  • Tested on synthetic and QQP redundancy, multiple proxy families, several NLP tasks, and DistilBERT fine-tuning
  • Paper available on arXiv with ID 2605.00944
  • Existing pipelines score examples pointwise and assume independence
  • Stochastic training causes unstable relative orderings across random seeds
  • SCARV is a modular aggregation framework operating on top of an existing scoring proxy

Entities

Institutions

  • arXiv

Sources