VISTA: V-JEPA Integrated Anticipator for Ego4D STA Challenge

ai-technology · 2026-05-22

A new technical report has unveiled VISTA, a model created for the Ego4D Short-Term Object Interaction Anticipation Challenge at EgoVis 2026. This competition aims to predict human-object interactions by evaluating egocentric video footage. Contestants are required to submit predictions that consist of bounding boxes for objects, their respective noun and verb categories, time-to-contact estimates, and confidence levels. VISTA adopts a StillFast-like framework, integrating spatial detection with short-term temporal analysis. The model utilizes a COCO-trained Faster R-CNN ResNet-50 FPN for object detection and includes a frozen V-JEPA 2.1 temporal component for enhanced contextual understanding.

Key facts

VISTA is proposed for the Ego4D STA Challenge at EgoVis 2026.
The task anticipates the next human-object interaction from egocentric video.
Output includes bounding box, noun, verb, time-to-contact, and confidence.
VISTA uses a StillFast-style design combining spatial detection and temporal context.
Object proposals come from a COCO-pretrained Faster R-CNN ResNet-50 FPN detector.
Temporal context is extracted by a frozen V-JEPA 2.1 branch.
Temporal representation is fused via feature modulation and ROI-level context fusion.
Fused features are passed to multi-head STA predictors.

VISTA: V-JEPA Integrated Anticipator for Ego4D STA Challenge

Key facts

Entities

Institutions

Sources