Event Sparsity-Aware Transformer for Visual Object Tracking

ai-technology · 2026-05-09

Researchers propose a sparsity-aware Mixture-of-Experts Transformer for event-based visual object tracking. Event cameras, which capture asynchronous brightness changes, offer advantages over RGB in low light and fast motion. Existing trackers often ignore event data's spatial sparsity and temporal density, using a fixed temporal-window sampling strategy. The new framework models event-density variations across multiple temporal scales, injecting sparse, medium-density, and dense event regions into a three-stage Vision Transformer backbone for hierarchical multi-density feature learning. A sparsity-aware routing mechanism adaptively selects the most relevant expert for each region. Experiments on FE108, VisEvent, and COESOT datasets show state-of-the-art performance, particularly in challenging conditions. The work addresses a key limitation in event-based tracking by leveraging the unique properties of event data.

Key facts

Proposes sparsity-aware Mixture-of-Experts Transformer for event-based tracking
Models event-density variations across multiple temporal scales
Injects sparse, medium-density, and dense event regions into three-stage Vision Transformer
Introduces sparsity-aware routing mechanism for expert selection
Achieves state-of-the-art on FE108, VisEvent, and COESOT datasets
Addresses limitations of fixed temporal-window sampling in existing trackers
Event cameras provide high dynamic range and temporal resolution
RGB-based trackers vulnerable to low illumination and fast motion

Entities

—

Sources

arXiv cs.AI — 2026-05-09