ARTFEED — Contemporary Art Intelligence

CNN-Transformer Ensembles Improve Interpretable Diabetic Retinopathy Grading

ai-technology · 2026-04-29

A novel approach integrates discriminative models with multimodal explanations for grading diabetic retinopathy (DR), transforming retinal pixels into outputs that are clinically understandable. Researchers assessed six CNN- and transformer-based architectures using the APTOS 2019 benchmark, employing stratified five-fold cross-validation. They explored various ensembling techniques, such as hard voting, weighted soft voting, and stacking, while also examining a hybrid class-level fusion method to leverage advantages specific to different grades. To enhance interpretability, Grad-CAM++ visual attribution maps and concise textual justifications were generated through vision-language models (VLMs), based on fundus images and classifier outputs with conservative prompting constraints.

Key facts

  • Methodology combines discriminative models with multimodal explanations for DR grading
  • Evaluated six CNN- and transformer-based backbones on APTOS 2019 benchmark
  • Used stratified five-fold cross-validation
  • Compared hard voting, weighted soft voting, and stacking ensembling strategies
  • Investigated hybrid class-level fusion variant
  • Produced Grad-CAM++ visual attribution maps
  • Generated short textual rationales using vision-language models
  • VLMs conditioned on fundus image and classifier outputs under conservative prompting

Entities

Institutions

  • arXiv

Sources