GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
The recently introduced GEM (Geometric Entropy Mixing) framework redefines the curation of pre-training data for large language models (LLMs) as a variational issue on the hypersphere, incorporating a mixing-balance regularizer. This approach overcomes limitations found in human taxonomies and Euclidean clustering by separating the generative prior and employing a verifiable MM algorithm for optimization. GEM leverages teacher-student distillation to achieve scalability at web level and presents the Geometric Influence Score (GIS) for generating interpretable taxonomies. Tests conducted on models with 1.1 billion parameters demonstrate cutting-edge performance.
Key facts
- GEM reformulates data curation as a variational problem on the hypersphere.
- It uses a mixing-balance regularizer.
- It decouples the generative prior and optimizes via a provable MM algorithm.
- It employs teacher-student distillation for web-scale corpora.
- It introduces the Geometric Influence Score (GIS) for interpretable taxonomy generation.
- Experiments were conducted with 1.1B-parameter models.
- GEM establishes a new state-of-the-art.
- The paper is available on arXiv with ID 2605.26121.
Entities
Institutions
- arXiv