ARTFEED — Contemporary Art Intelligence

AWS Building Blocks for Foundation Model Training and Inference

other · 2026-05-12

A technical blog post by AWS and NVIDIA engineers details the infrastructure and software stack required for large-scale foundation model training and inference on AWS. The post, authored by Aman Shanbhag (NVIDIA), Pavel Belevich, and Keita Watanabe (both AWS), outlines a four-layer architecture: infrastructure (EC2 P-instances with NVIDIA GPUs, EFA networking, tiered storage), resource orchestration (Slurm, Kubernetes, SageMaker HyperPod), ML software stack (CUDA, NCCL, PyTorch, distributed frameworks), and observability (Prometheus, Grafana, DCGM). Key hardware includes P5 (H100), P5e (H200), P6 (B200/B300) instances, and P6e-GB200 UltraServers with up to 72 Blackwell GPUs. The post emphasizes convergent requirements across pre-training, post-training, and inference: tightly coupled compute, high-bandwidth low-latency networking, and distributed storage. It also covers open-source tools like Slurm, Kubernetes, Kueue, Volcano, and frameworks like Megatron Core, NeMo, vLLM, and SGLang. Observability is highlighted as critical for debugging at scale, with GPU health monitoring via DCGM-Exporter and Grafana dashboards.

Key facts

  • The post is authored by Aman Shanbhag (NVIDIA), Pavel Belevich, and Keita Watanabe (AWS).
  • It describes a four-layer architecture: infrastructure, resource orchestration, ML software stack, and observability.
  • AWS EC2 instances covered include P5 (H100), P5e (H200), P6 (B200/B300), and P6e-GB200 UltraServers.
  • P6e-GB200 UltraServers connect up to 72 Blackwell GPUs in one NVLink domain.
  • EFA networking versions include EFAv2, EFAv3 (35% lower latency), and EFAv4 (18% improvement over v3).
  • Resource orchestration options include Slurm (via AWS ParallelCluster, PCS, SageMaker HyperPod) and Kubernetes (via EKS, HyperPod EKS).
  • HyperPod EKS features task governance, checkpointless training, and elastic training.
  • ML software stack includes CUDA, NCCL, aws-ofi-nccl plugin, PyTorch, and frameworks like Megatron Core, NeMo, vLLM, SGLang.
  • Observability uses Prometheus, Grafana, DCGM-Exporter, and EFA counters.
  • The post targets machine learning engineers and researchers working with OSS frameworks on AWS.

Entities

Institutions

  • Amazon Web Services
  • NVIDIA
  • Amazon EC2
  • Amazon SageMaker HyperPod
  • Amazon EKS
  • Amazon FSx for Lustre
  • Amazon S3
  • Amazon Managed Service for Prometheus
  • Amazon Managed Grafana
  • Hugging Face
  • PyTorch
  • JAX
  • Slurm
  • Kubernetes
  • Kueue
  • Volcano
  • NVIDIA KAI Scheduler
  • Megatron Core
  • NeMo
  • vLLM
  • SGLang
  • NVIDIA Dynamo
  • NVIDIA Inference Xfer Library
  • NVIDIA Collective Communications Library
  • NVIDIA CUDA
  • NVIDIA Triton
  • NVIDIA CuTe
  • CUTLASS
  • FlashAttention
  • DeepSpeed
  • veRL
  • Libfabric
  • Prometheus
  • Grafana
  • DCGM-Exporter
  • AWS ParallelCluster
  • AWS Parallel Computing Service
  • Karpenter
  • GDRCopy
  • UCX
  • GPUDirect Storage

Sources