ARTFEED — Contemporary Art Intelligence

VISTA: V-JEPA Integrated Anticipator for Ego4D STA Challenge

ai-technology · 2026-05-22

A new technical report has unveiled VISTA, a model created for the Ego4D Short-Term Object Interaction Anticipation Challenge at EgoVis 2026. This competition aims to predict human-object interactions by evaluating egocentric video footage. Contestants are required to submit predictions that consist of bounding boxes for objects, their respective noun and verb categories, time-to-contact estimates, and confidence levels. VISTA adopts a StillFast-like framework, integrating spatial detection with short-term temporal analysis. The model utilizes a COCO-trained Faster R-CNN ResNet-50 FPN for object detection and includes a frozen V-JEPA 2.1 temporal component for enhanced contextual understanding.

Key facts

  • VISTA is proposed for the Ego4D STA Challenge at EgoVis 2026.
  • The task anticipates the next human-object interaction from egocentric video.
  • Output includes bounding box, noun, verb, time-to-contact, and confidence.
  • VISTA uses a StillFast-style design combining spatial detection and temporal context.
  • Object proposals come from a COCO-pretrained Faster R-CNN ResNet-50 FPN detector.
  • Temporal context is extracted by a frozen V-JEPA 2.1 branch.
  • Temporal representation is fused via feature modulation and ROI-level context fusion.
  • Fused features are passed to multi-head STA predictors.

Entities

Institutions

  • Ego4D
  • EgoVis

Sources