EmbGen: Synthetic Data Pipeline for Domain-Specific LLM Training

ai-technology · 2026-05-20

EmbGen is a pipeline for generating synthetic data that tailors small instruction-tuned models to specific domains. It breaks down a domain corpus into pairs of entity descriptions, which are then reconstructed using semantic structures based on embedding similarities. Question-answer pairs are produced through proximity sampling, as well as intra-cluster and inter-cluster sampling with prompts tailored to specific clusters. The performance of EmbGen was assessed against EntiGraph, InstructLab, and Knowledge-Instruct across three datasets exhibiting diverse semantic heterogeneity, utilizing fixed token budgets of 5 and 20 million tokens. This approach seeks to lower the expenses associated with gathering curated instruction-response examples for supervised fine-tuning.

Key facts

EmbGen decomposes a corpus into entity-description pairs
Reassembles pairs using semantic structure from embedding similarity
Generates QA pairs via proximity, intra-cluster, and inter-cluster sampling
Uses cluster-specialized system prompts
Evaluated against EntiGraph, InstructLab, and Knowledge-Instruct
Tested on three datasets with varied semantic heterogeneity
Fixed token budgets of 5 and 20 million tokens
Aims to reduce cost of SFT data collection

Entities

—

Sources

arXiv cs.AI — 2026-05-20