GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

ai-technology · 2026-05-27

The recently introduced GEM (Geometric Entropy Mixing) framework redefines the curation of pre-training data for large language models (LLMs) as a variational issue on the hypersphere, incorporating a mixing-balance regularizer. This approach overcomes limitations found in human taxonomies and Euclidean clustering by separating the generative prior and employing a verifiable MM algorithm for optimization. GEM leverages teacher-student distillation to achieve scalability at web level and presents the Geometric Influence Score (GIS) for generating interpretable taxonomies. Tests conducted on models with 1.1 billion parameters demonstrate cutting-edge performance.

Key facts

GEM reformulates data curation as a variational problem on the hypersphere.
It uses a mixing-balance regularizer.
It decouples the generative prior and optimizes via a provable MM algorithm.
It employs teacher-student distillation for web-scale corpora.
It introduces the Geometric Influence Score (GIS) for interpretable taxonomy generation.
Experiments were conducted with 1.1B-parameter models.
GEM establishes a new state-of-the-art.
The paper is available on arXiv with ID 2605.26121.

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Key facts

Entities

Institutions

Sources