CRAFT: A New Method for Efficient Training Data Selection
A novel approach named CRAFT (Clustered Regression for Adaptive Filtering of Training data) has been introduced by researchers to identify high-quality subsets from extensive datasets for fine-tuning sequence-to-sequence models, independent of vectorization methods. The technique involves a two-stage selection process: initially, it aligns the validation source distribution by allocating budgets proportionately across k-means clusters; subsequently, it chooses training pairs within each cluster that minimize a conditional expected distance based on the validation target distribution. CRAFT effectively constrains the continuous KL divergence between the selected and validation distributions, with the residual managed by the diameters of the clusters. This method has been tested on English language tasks, responding to the increasing demand for efficient fine-tuning as datasets grow to tens of millions of data points.
Key facts
- CRAFT stands for Clustered Regression for Adaptive Filtering of Training data.
- It is a vectorization-agnostic selection method for sequence-to-sequence models.
- The method decomposes the joint source-target distribution.
- Selection involves two stages: proportional budget allocation across k-means clusters and minimizing conditional expected distance.
- Proportional cluster allocation bounds continuous KL divergence between selected and validation distributions.
- The residual is controlled by cluster diameters.
- CRAFT is evaluated on English language tasks.
- The method addresses the challenge of fine-tuning on large corpora with tens of millions of datapoints.
Entities
—