ARTFEED — Contemporary Art Intelligence

Memory-Efficient VLA Inference on VRAM-Constrained GPUs via CPU-GPU Swapping

ai-technology · 2026-05-13

A new framework enables memory-efficient inference for Vision-Language-Action (VLA) models on commodity GPUs with only 12-16GB VRAM, without modifying the model. VLA models for autonomous driving typically require 20-60GB GPU memory. The approach uses three stages: Sequential Demand Layering reduces VRAM usage to layer-level granularity; Pipelined Demand Layering overlaps parameter transfer with computation; and a GPU-Resident Layer Decision Policy eliminates residual transfer overhead. A performance prediction model determines optimal configuration. The work is published on arXiv (2605.11678).

Key facts

  • VLA models require 20-60GB GPU memory
  • Commodity GPUs have 12-16GB VRAM
  • Framework enables inference without model modification
  • Three-stage optimization: Sequential Demand Layering, Pipelined Demand Layering, GPU-Resident Layer Decision Policy
  • Performance prediction model for optimal configuration
  • Published on arXiv with ID 2605.11678

Entities

Institutions

  • arXiv

Sources