ConTrans: New Architecture for Zero-Shot Action Localization
A research paper introduces ConTrans, a novel multi-scale encoder architecture for Zero-shot Temporal Action Localization (ZS-TAL). The method integrates convolutional inductive biases with transformer self-attention to capture both fine-grained local dependencies and long-range global context, addressing limitations of existing approaches that neglect relative-offset-based local correlations and suffer from shallow network architectures. Experimental evaluations on ActivityNet-1.3 and THUMOS datasets demonstrate improved feature representations.
Key facts
- ConTrans integrates convolutional inductive biases with transformer self-attention.
- It captures fine-grained local dependencies and long-range global context.
- Addresses limitations of existing ZS-TAL methods that neglect local correlations.
- Evaluated on ActivityNet-1.3 and THUMOS datasets.
- Aims to detect and locate unseen actions in untrimmed videos.
Entities
—