Systematic Comparison of Asynchronous Inference Methods for VLA Models

publication · 2026-05-12

A recent study has explored four innovative techniques aimed at addressing the problem of observation staleness in Vision-Language-Action (VLA) models, which arises from the lag in action execution. The methodologies reviewed include inference-time inpainting, training-time delay simulation, future-state-aware conditioning, and lightweight residual correction. Researchers created two unified codebases that ensure consistency across approaches with standardized libraries and datasets. They conducted benchmarking using the Kinetix suite alongside the LIBERO manipulation benchmark, evaluating inference delays of up to 20 control steps. This research is available on arXiv with the identifier 2605.08168.

Key facts

Four methods for mitigating observation staleness in VLA models are compared: IT-RTC, TT-RTC, VLASH, A2C2.
Two unified codebases were developed for fair comparison.
Benchmarking was performed on Kinetix suite with MLPMixer policies and LIBERO benchmark with SmolVLA.
Inference delays up to d=20 control steps were tested.
The study is published on arXiv with ID 2605.08168.

Systematic Comparison of Asynchronous Inference Methods for VLA Models

Key facts

Entities

Institutions

Sources