SAM 3 and DINOv3 Distilled for Edge-Deployable Livestock Monitoring
So, there’s this new research article on arXiv (2604.27128) that talks about a method for shrinking the 446 million-parameter Perception Encoder backbone of SAM 3 down to a more convenient 40.66 million-parameter student model. This model is aimed at tracking livestock individually on edge devices. It uses a TinyViT-21M-512-based Feature Pyramid Network and has a unique four-term distillation loss method. To manage GPU memory better, it employs backbone-substitution inference with sliding-window session pruning. Plus, the DINOv3 series includes a pre-distilled ViT-S/16 model with 21.6 million parameters, released alongside a big 6716 million-parameter ViT-7B teacher, making it useful for precision livestock farming on more budget-friendly devices.
Key facts
- arXiv paper 2604.27128
- SAM 3 Perception Encoder distilled from 446M to 40.66M parameters
- Student encoder uses TinyViT-21M-512 with Feature Pyramid Network
- Four-term direction-then-scale distillation loss used
- Sliding-window session pruning bounds streaming GPU memory
- DINOv3 ViT-S/16 variant has 21.6M parameters
- DINOv3 ViT-7B teacher has 6716M parameters
- ViT-S (21M) adopted as per-individual embedder
Entities
Institutions
- arXiv