CNN-Transformer Ensembles Improve Interpretable Diabetic Retinopathy Grading
A novel approach integrates discriminative models with multimodal explanations for grading diabetic retinopathy (DR), transforming retinal pixels into outputs that are clinically understandable. Researchers assessed six CNN- and transformer-based architectures using the APTOS 2019 benchmark, employing stratified five-fold cross-validation. They explored various ensembling techniques, such as hard voting, weighted soft voting, and stacking, while also examining a hybrid class-level fusion method to leverage advantages specific to different grades. To enhance interpretability, Grad-CAM++ visual attribution maps and concise textual justifications were generated through vision-language models (VLMs), based on fundus images and classifier outputs with conservative prompting constraints.
Key facts
- Methodology combines discriminative models with multimodal explanations for DR grading
- Evaluated six CNN- and transformer-based backbones on APTOS 2019 benchmark
- Used stratified five-fold cross-validation
- Compared hard voting, weighted soft voting, and stacking ensembling strategies
- Investigated hybrid class-level fusion variant
- Produced Grad-CAM++ visual attribution maps
- Generated short textual rationales using vision-language models
- VLMs conditioned on fundus image and classifier outputs under conservative prompting
Entities
Institutions
- arXiv