EmbGen: Synthetic Data Pipeline for Domain-Specific LLM Training
EmbGen is a pipeline for generating synthetic data that tailors small instruction-tuned models to specific domains. It breaks down a domain corpus into pairs of entity descriptions, which are then reconstructed using semantic structures based on embedding similarities. Question-answer pairs are produced through proximity sampling, as well as intra-cluster and inter-cluster sampling with prompts tailored to specific clusters. The performance of EmbGen was assessed against EntiGraph, InstructLab, and Knowledge-Instruct across three datasets exhibiting diverse semantic heterogeneity, utilizing fixed token budgets of 5 and 20 million tokens. This approach seeks to lower the expenses associated with gathering curated instruction-response examples for supervised fine-tuning.
Key facts
- EmbGen decomposes a corpus into entity-description pairs
- Reassembles pairs using semantic structure from embedding similarity
- Generates QA pairs via proximity, intra-cluster, and inter-cluster sampling
- Uses cluster-specialized system prompts
- Evaluated against EntiGraph, InstructLab, and Knowledge-Instruct
- Tested on three datasets with varied semantic heterogeneity
- Fixed token budgets of 5 and 20 million tokens
- Aims to reduce cost of SFT data collection
Entities
—