GraphSculptor: Efficient Pre-Training via Coreset Selection
Graph self-supervised learning (SSL) typically demands extensive unlabeled datasets, resulting in substantial computational expenses. Studies indicate that these datasets often exhibit considerable redundancy; for instance, uniformly subsampling 50% of the graphs maintains over 96% of downstream performance. To address this, researchers have developed GraphSculptor, a technique for creating pre-training coresets without the need for labels. This method leverages two complementary viewpoints: intrinsic structure and contextual semantics. Structural diversity is assessed through intrinsic graph statistics, generating a feature vector for each graph, while semantic diversity is represented by encoding descriptions produced via graph-to-text with a pre-trained language model. This label-free strategy eliminates the need for extra training-time signals or solely topological statistics. The approach is elaborated in a paper available on arXiv (2605.01310).
Key facts
- Graph self-supervised learning relies on large unlabeled datasets.
- Uniformly subsampling 50% of graphs retains over 96% of downstream performance.
- GraphSculptor constructs pre-training coresets without labels.
- It uses intrinsic structure and contextual semantics.
- Structural diversity is quantified via intrinsic graph statistics.
- Semantic diversity uses a pre-trained language model on graph-to-text descriptions.
- The method is label-free and avoids additional training-time signals.
- Paper available on arXiv with ID 2605.01310.
Entities
Institutions
- arXiv