LLM Synthetic Data for Patent Classification: Volume vs. Fidelity Trade-off
A recent study published on arXiv (2605.24296) explores the enhancement of low-resource multi-label patent classification through LLM-generated synthetic data. The research utilized six open-source LLMs (with 3.8-12B parameters), four real-data scenarios, 64 WIPO assistive-technology labels, two data generation methods, and three types of classifiers. The results indicated a significant increase in the headline micro-F1 score for BERT-for-Patents, rising from 0.120 to 0.702, primarily due to volume. A resampling control of 165 patents augmented to match size achieved 0.678, resulting in a minimal controlled synthetic gain of +0.024, and +0.219 with focal-loss reweighting. Notably, fidelity metrics shift with scale: at extreme scarcity, MMD positively correlates with classification gain (r=+0.95), but this relationship reverses at a 1:10 ratio (r=-0.73; Fisher z=+6.47, p<0.001). The optimal synthetic proportion suggested by fixed-budget mixing is 20-30%.
Key facts
- Study on arXiv: 2605.24296
- Uses six open-source LLMs (3.8-12B)
- Four real-data regimes tested
- 64 WIPO assistive-technology labels
- Two generation strategies
- Three classifier families
- BERT-for-Patents micro-F1 from 0.120 to 0.702
- Duplicate-to-match control reaches 0.678
- Controlled synthetic gain: +0.024
- Gain over focal-loss reweighting: +0.219
- MMD correlation flips with scale
- At extreme scarcity: r=+0.95
- At 1:10 ratio: r=-0.73 (p<0.001)
- Optimal synthetic proportion: 20-30%
Entities
Institutions
- arXiv
- WIPO