Automated Dosing Error Detection in Clinical Trial Narratives Using LightGBM
A new automated system detects dosing errors in clinical trial narratives using gradient boosting and multi-modal feature engineering. The approach combines 3,451 features from NLP, semantic embeddings, medical patterns, and transformer models to train a LightGBM model on 42,112 narratives. On the CT-DEB benchmark, it achieves 0.8725 test ROC-AUC with 5-fold ensemble averaging.
Key facts
- System uses LightGBM with 3,451 features
- Features include TF-IDF, character n-grams, all-MiniLM-L6v2 embeddings, BiomedBERT, DeBERTa-v3
- Trained on 42,112 clinical trial narratives
- Achieves 0.8725 test ROC-AUC
- Cross-validation AUC: 0.8833 ± 0.0091
- Dataset has severe class imbalance (4.9% positive rate)
- Features extracted from nine complementary text fields
- Median 5,400 characters per sample
Entities
—