GraphSculptor: Efficient Pre-Training via Coreset Selection

other · 2026-05-06

Graph self-supervised learning (SSL) typically demands extensive unlabeled datasets, resulting in substantial computational expenses. Studies indicate that these datasets often exhibit considerable redundancy; for instance, uniformly subsampling 50% of the graphs maintains over 96% of downstream performance. To address this, researchers have developed GraphSculptor, a technique for creating pre-training coresets without the need for labels. This method leverages two complementary viewpoints: intrinsic structure and contextual semantics. Structural diversity is assessed through intrinsic graph statistics, generating a feature vector for each graph, while semantic diversity is represented by encoding descriptions produced via graph-to-text with a pre-trained language model. This label-free strategy eliminates the need for extra training-time signals or solely topological statistics. The approach is elaborated in a paper available on arXiv (2605.01310).

Key facts

Graph self-supervised learning relies on large unlabeled datasets.
Uniformly subsampling 50% of graphs retains over 96% of downstream performance.
GraphSculptor constructs pre-training coresets without labels.
It uses intrinsic structure and contextual semantics.
Structural diversity is quantified via intrinsic graph statistics.
Semantic diversity uses a pre-trained language model on graph-to-text descriptions.
The method is label-free and avoids additional training-time signals.
Paper available on arXiv with ID 2605.01310.

GraphSculptor: Efficient Pre-Training via Coreset Selection

Key facts

Entities

Institutions

Sources