ARTFEED — Contemporary Art Intelligence

Physical AI Inference Gap: Batch-1 LLM Decode Not Memory-Bandwidth-Limited

ai-technology · 2026-06-01

A recent study disputes the traditional belief that batch-1 autoregressive decoding in Physical AI systems is mainly restricted by memory bandwidth. The researchers evaluated single-stream decoding for three GQA transformers in the 7-8B range across four NVIDIA GPUs (H100 SXM5, A100-80GB SXM4, L40S, L4) with context lengths varying from 2048 to 16384, resulting in 44 valid cells under bf16 SDPA. Their results revealed that the fraction of peak HBM bandwidth achieved diminishes as peak bandwidth rises. For instance, the L4 on Qwen-2.5-7B at a context length of 2048 attains about 81% of its theoretical memory floor, while the H100 achieves a lower percentage. This suggests that factors beyond bandwidth, such as compute or memory latency, also influence performance, impacting optimization for robots, autonomous vehicles, and edge devices.

Key facts

  • Physical AI systems run batch-1 autoregressive decode, not cloud LLM serving.
  • Workload is usually described as memory-bandwidth-bound.
  • Study measured three 7-8B class GQA transformers on four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S, L4.
  • Context lengths evaluated: 2048 to 16384.
  • 44 valid cells produced under controlled bf16 SDPA setup.
  • Achieved fraction of peak HBM bandwidth falls as peak bandwidth rises.
  • On Qwen-2.5-7B ctx=2048, L4 reaches ~81% of analytic memory floor; H100 reaches lower fraction.
  • Findings suggest the workload is not purely memory-bandwidth-bound.

Entities

Institutions

  • NVIDIA
  • arXiv

Sources