Tree Canopy Segmentation Under Data Scarcity: Pretrained CNNs Outperform Transformers

other · 2026-05-07

A recent study available on arXiv assesses five deep learning models for segmenting tree canopies from aerial images, focusing on a scenario with limited data—specifically, just 150 labeled images from the Solafune Tree Canopy Detection competition. The architectures examined include YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2. The findings indicate that convolution-based models, especially YOLOv11 and Mask R-CNN, demonstrate superior generalization compared to transformer-based models. Conversely, DeepLabv3, Swin-UNet, and DINOv2 struggle due to the distinct nature of semantic versus instance segmentation, the substantial data needs of Vision Transformers, and certain architectural limitations. These results underscore the critical role of model choice in environmental monitoring and urban planning when annotated data is scarce.

Key facts

Study evaluates five architectures: YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, DINOv2
Dataset from Solafune Tree Canopy Detection competition: 150 annotated images
Convolution-based models (YOLOv11, Mask R-CNN) outperform transformer-based models
DeepLabv3, Swin-UNet, DINOv2 underperform due to task mismatch and data requirements
Research addresses data annotation scarcity in aerial imagery analysis
Published on arXiv with ID 2601.10931v2
Application areas: environmental monitoring, urban planning, ecosystem analysis

Tree Canopy Segmentation Under Data Scarcity: Pretrained CNNs Outperform Transformers

Key facts

Entities

Institutions

Sources