CDVM: Optimizing Data Pruning in Low-Data Environments
A new paper on arXiv introduces Constraint-Data-Value-Maximization (CDVM), a method for effective data pruning when only a small fraction of training data remains. The authors demonstrate that Shapley-based data values are suboptimal for pruning low-value data in low-data scenarios. CDVM frames pruning as a constrained optimization that maximizes total influence while penalizing excessive per-test contributions, achieving robust performance on the OpenDataVal benchmark.
Key facts
- arXiv paper 2605.11312 introduces CDVM.
- CDVM addresses data pruning in low-data environments.
- Shapley-based data values are suboptimal for low-data pruning.
- CDVM casts pruning as constrained optimization.
- It maximizes total influence and penalizes per-test contributions.
- CDVM shows strong performance on OpenDataVal benchmark.
- The paper is from arXiv, published in 2025.
- Data attribution is the broader research field.
Entities
Institutions
- arXiv
- OpenDataVal