Tree Canopy Segmentation Under Data Scarcity: Pretrained CNNs Outperform Transformers
A recent study available on arXiv assesses five deep learning models for segmenting tree canopies from aerial images, focusing on a scenario with limited data—specifically, just 150 labeled images from the Solafune Tree Canopy Detection competition. The architectures examined include YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2. The findings indicate that convolution-based models, especially YOLOv11 and Mask R-CNN, demonstrate superior generalization compared to transformer-based models. Conversely, DeepLabv3, Swin-UNet, and DINOv2 struggle due to the distinct nature of semantic versus instance segmentation, the substantial data needs of Vision Transformers, and certain architectural limitations. These results underscore the critical role of model choice in environmental monitoring and urban planning when annotated data is scarce.
Key facts
- Study evaluates five architectures: YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, DINOv2
- Dataset from Solafune Tree Canopy Detection competition: 150 annotated images
- Convolution-based models (YOLOv11, Mask R-CNN) outperform transformer-based models
- DeepLabv3, Swin-UNet, DINOv2 underperform due to task mismatch and data requirements
- Research addresses data annotation scarcity in aerial imagery analysis
- Published on arXiv with ID 2601.10931v2
- Application areas: environmental monitoring, urban planning, ecosystem analysis
Entities
Institutions
- arXiv
- Solafune