LLM Synthetic Data for Patent Classification: Volume vs. Fidelity Trade-off

other · 2026-05-26

A recent study published on arXiv (2605.24296) explores the enhancement of low-resource multi-label patent classification through LLM-generated synthetic data. The research utilized six open-source LLMs (with 3.8-12B parameters), four real-data scenarios, 64 WIPO assistive-technology labels, two data generation methods, and three types of classifiers. The results indicated a significant increase in the headline micro-F1 score for BERT-for-Patents, rising from 0.120 to 0.702, primarily due to volume. A resampling control of 165 patents augmented to match size achieved 0.678, resulting in a minimal controlled synthetic gain of +0.024, and +0.219 with focal-loss reweighting. Notably, fidelity metrics shift with scale: at extreme scarcity, MMD positively correlates with classification gain (r=+0.95), but this relationship reverses at a 1:10 ratio (r=-0.73; Fisher z=+6.47, p<0.001). The optimal synthetic proportion suggested by fixed-budget mixing is 20-30%.

Key facts

Study on arXiv: 2605.24296
Uses six open-source LLMs (3.8-12B)
Four real-data regimes tested
64 WIPO assistive-technology labels
Two generation strategies
Three classifier families
BERT-for-Patents micro-F1 from 0.120 to 0.702
Duplicate-to-match control reaches 0.678
Controlled synthetic gain: +0.024
Gain over focal-loss reweighting: +0.219
MMD correlation flips with scale
At extreme scarcity: r=+0.95
At 1:10 ratio: r=-0.73 (p<0.001)
Optimal synthetic proportion: 20-30%

LLM Synthetic Data for Patent Classification: Volume vs. Fidelity Trade-off

Key facts

Entities

Institutions

Sources