VISTA: V-JEPA Integrated Anticipator for Ego4D STA Challenge
A new technical report has unveiled VISTA, a model created for the Ego4D Short-Term Object Interaction Anticipation Challenge at EgoVis 2026. This competition aims to predict human-object interactions by evaluating egocentric video footage. Contestants are required to submit predictions that consist of bounding boxes for objects, their respective noun and verb categories, time-to-contact estimates, and confidence levels. VISTA adopts a StillFast-like framework, integrating spatial detection with short-term temporal analysis. The model utilizes a COCO-trained Faster R-CNN ResNet-50 FPN for object detection and includes a frozen V-JEPA 2.1 temporal component for enhanced contextual understanding.
Key facts
- VISTA is proposed for the Ego4D STA Challenge at EgoVis 2026.
- The task anticipates the next human-object interaction from egocentric video.
- Output includes bounding box, noun, verb, time-to-contact, and confidence.
- VISTA uses a StillFast-style design combining spatial detection and temporal context.
- Object proposals come from a COCO-pretrained Faster R-CNN ResNet-50 FPN detector.
- Temporal context is extracted by a frozen V-JEPA 2.1 branch.
- Temporal representation is fused via feature modulation and ROI-level context fusion.
- Fused features are passed to multi-head STA predictors.
Entities
Institutions
- Ego4D
- EgoVis