SCARV: Stable Sample Ranking for Redundant NLP Datasets
A novel framework named SCARV (Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets) has been introduced to tackle the issue of inconsistent sample-level rankings in NLP datasets featuring duplicates and near-duplicates. Current methodologies evaluate training examples individually, treating them as separate entities, which proves inadequate in the presence of redundancy. SCARV integrates resilient multi-seed aggregation with structure-aware aggregation across redundancy clusters. It underwent testing on synthetic redundancy, naturally sourced QQP redundancy, various proxy families, multiple NLP tasks, and complete DistilBERT fine-tuning. The research paper can be found on arXiv under ID 2605.00944.
Key facts
- SCARV stands for Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
- Addresses unstable sample-level rankings due to duplicates, near-duplicates, and paraphrases
- Combines multi-seed aggregation with structure-aware aggregation over redundancy clusters
- Tested on synthetic and QQP redundancy, multiple proxy families, several NLP tasks, and DistilBERT fine-tuning
- Paper available on arXiv with ID 2605.00944
- Existing pipelines score examples pointwise and assume independence
- Stochastic training causes unstable relative orderings across random seeds
- SCARV is a modular aggregation framework operating on top of an existing scoring proxy
Entities
Institutions
- arXiv