Memory-Efficient VLA Inference on VRAM-Constrained GPUs via CPU-GPU Swapping

ai-technology · 2026-05-13

A new framework enables memory-efficient inference for Vision-Language-Action (VLA) models on commodity GPUs with only 12-16GB VRAM, without modifying the model. VLA models for autonomous driving typically require 20-60GB GPU memory. The approach uses three stages: Sequential Demand Layering reduces VRAM usage to layer-level granularity; Pipelined Demand Layering overlaps parameter transfer with computation; and a GPU-Resident Layer Decision Policy eliminates residual transfer overhead. A performance prediction model determines optimal configuration. The work is published on arXiv (2605.11678).

Key facts

VLA models require 20-60GB GPU memory
Commodity GPUs have 12-16GB VRAM
Framework enables inference without model modification
Three-stage optimization: Sequential Demand Layering, Pipelined Demand Layering, GPU-Resident Layer Decision Policy
Performance prediction model for optimal configuration
Published on arXiv with ID 2605.11678

Memory-Efficient VLA Inference on VRAM-Constrained GPUs via CPU-GPU Swapping

Key facts

Entities

Institutions

Sources