ARTFEED — Contemporary Art Intelligence

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

ai-technology · 2026-05-27

The recently introduced GEM (Geometric Entropy Mixing) framework redefines the curation of pre-training data for large language models (LLMs) as a variational issue on the hypersphere, incorporating a mixing-balance regularizer. This approach overcomes limitations found in human taxonomies and Euclidean clustering by separating the generative prior and employing a verifiable MM algorithm for optimization. GEM leverages teacher-student distillation to achieve scalability at web level and presents the Geometric Influence Score (GIS) for generating interpretable taxonomies. Tests conducted on models with 1.1 billion parameters demonstrate cutting-edge performance.

Key facts

  • GEM reformulates data curation as a variational problem on the hypersphere.
  • It uses a mixing-balance regularizer.
  • It decouples the generative prior and optimizes via a provable MM algorithm.
  • It employs teacher-student distillation for web-scale corpora.
  • It introduces the Geometric Influence Score (GIS) for interpretable taxonomy generation.
  • Experiments were conducted with 1.1B-parameter models.
  • GEM establishes a new state-of-the-art.
  • The paper is available on arXiv with ID 2605.26121.

Entities

Institutions

  • arXiv

Sources