SCARV: Stable Sample Ranking for Redundant NLP Datasets

other · 2026-05-06

A novel framework named SCARV (Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets) has been introduced to tackle the issue of inconsistent sample-level rankings in NLP datasets featuring duplicates and near-duplicates. Current methodologies evaluate training examples individually, treating them as separate entities, which proves inadequate in the presence of redundancy. SCARV integrates resilient multi-seed aggregation with structure-aware aggregation across redundancy clusters. It underwent testing on synthetic redundancy, naturally sourced QQP redundancy, various proxy families, multiple NLP tasks, and complete DistilBERT fine-tuning. The research paper can be found on arXiv under ID 2605.00944.

Key facts

SCARV stands for Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
Addresses unstable sample-level rankings due to duplicates, near-duplicates, and paraphrases
Combines multi-seed aggregation with structure-aware aggregation over redundancy clusters
Tested on synthetic and QQP redundancy, multiple proxy families, several NLP tasks, and DistilBERT fine-tuning
Paper available on arXiv with ID 2605.00944
Existing pipelines score examples pointwise and assume independence
Stochastic training causes unstable relative orderings across random seeds
SCARV is a modular aggregation framework operating on top of an existing scoring proxy

SCARV: Stable Sample Ranking for Redundant NLP Datasets

Key facts

Entities

Institutions

Sources