AWS Building Blocks for Foundation Model Training and Inference

other · 2026-05-12

A technical blog post by AWS and NVIDIA engineers details the infrastructure and software stack required for large-scale foundation model training and inference on AWS. The post, authored by Aman Shanbhag (NVIDIA), Pavel Belevich, and Keita Watanabe (both AWS), outlines a four-layer architecture: infrastructure (EC2 P-instances with NVIDIA GPUs, EFA networking, tiered storage), resource orchestration (Slurm, Kubernetes, SageMaker HyperPod), ML software stack (CUDA, NCCL, PyTorch, distributed frameworks), and observability (Prometheus, Grafana, DCGM). Key hardware includes P5 (H100), P5e (H200), P6 (B200/B300) instances, and P6e-GB200 UltraServers with up to 72 Blackwell GPUs. The post emphasizes convergent requirements across pre-training, post-training, and inference: tightly coupled compute, high-bandwidth low-latency networking, and distributed storage. It also covers open-source tools like Slurm, Kubernetes, Kueue, Volcano, and frameworks like Megatron Core, NeMo, vLLM, and SGLang. Observability is highlighted as critical for debugging at scale, with GPU health monitoring via DCGM-Exporter and Grafana dashboards.

Key facts

The post is authored by Aman Shanbhag (NVIDIA), Pavel Belevich, and Keita Watanabe (AWS).
It describes a four-layer architecture: infrastructure, resource orchestration, ML software stack, and observability.
AWS EC2 instances covered include P5 (H100), P5e (H200), P6 (B200/B300), and P6e-GB200 UltraServers.
P6e-GB200 UltraServers connect up to 72 Blackwell GPUs in one NVLink domain.
EFA networking versions include EFAv2, EFAv3 (35% lower latency), and EFAv4 (18% improvement over v3).
Resource orchestration options include Slurm (via AWS ParallelCluster, PCS, SageMaker HyperPod) and Kubernetes (via EKS, HyperPod EKS).
HyperPod EKS features task governance, checkpointless training, and elastic training.
ML software stack includes CUDA, NCCL, aws-ofi-nccl plugin, PyTorch, and frameworks like Megatron Core, NeMo, vLLM, SGLang.
Observability uses Prometheus, Grafana, DCGM-Exporter, and EFA counters.
The post targets machine learning engineers and researchers working with OSS frameworks on AWS.

Entities

Institutions

Amazon Web Services
NVIDIA
Amazon EC2
Amazon SageMaker HyperPod
Amazon EKS
Amazon FSx for Lustre
Amazon S3
Amazon Managed Service for Prometheus
Amazon Managed Grafana
Hugging Face
PyTorch
JAX
Slurm
Kubernetes
Kueue
Volcano
NVIDIA KAI Scheduler
Megatron Core
NeMo
vLLM
SGLang
NVIDIA Dynamo
NVIDIA Inference Xfer Library
NVIDIA Collective Communications Library
NVIDIA CUDA
NVIDIA Triton
NVIDIA CuTe
CUTLASS
FlashAttention
DeepSpeed
veRL
Libfabric
Prometheus
Grafana
DCGM-Exporter
AWS ParallelCluster
AWS Parallel Computing Service
Karpenter
GDRCopy
UCX
GPUDirect Storage

Sources

Hugging Face Blog — 2026-05-11