Blocchi costitutivi AWS per addestramento e inferenza di modelli foundation

other · 2026-05-12

Un post tecnico di ingegneri AWS e NVIDIA descrive l'infrastruttura e lo stack software necessari per l'addestramento e l'inferenza su larga scala di modelli foundation su AWS. Il post, scritto da Aman Shanbhag (NVIDIA), Pavel Belevich e Keita Watanabe (AWS), delinea un'architettura a quattro livelli: infrastruttura (istanze EC2 P con GPU NVIDIA, rete EFA, storage a livelli), orchestrazione delle risorse (Slurm, Kubernetes, SageMaker HyperPod), stack software ML (CUDA, NCCL, PyTorch, framework distribuiti) e osservabilità (Prometheus, Grafana, DCGM). L'hardware chiave include istanze P5 (H100), P5e (H200), P6 (B200/B300) e P6e-GB200 UltraServers con fino a 72 GPU Blackwell. Il post sottolinea requisiti convergenti per pre-addestramento, post-addestramento e inferenza: calcolo strettamente accoppiato, rete ad alta larghezza di banda e bassa latenza, e storage distribuito. Copre anche strumenti open source come Slurm, Kubernetes, Kueue, Volcano e framework come Megatron Core, NeMo, vLLM e SGLang. L'osservabilità è evidenziata come critica per il debugging su larga scala, con monitoraggio della salute della GPU tramite DCGM-Exporter e dashboard Grafana.

Fatti principali

Il post è scritto da Aman Shanbhag (NVIDIA), Pavel Belevich e Keita Watanabe (AWS).
Descrive un'architettura a quattro livelli: infrastruttura, orchestrazione delle risorse, stack software ML e osservabilità.
Le istanze EC2 AWS trattate includono P5 (H100), P5e (H200), P6 (B200/B300) e P6e-GB200 UltraServers.
I P6e-GB200 UltraServers collegano fino a 72 GPU Blackwell in un unico dominio NVLink.
Le versioni di rete EFA includono EFAv2, EFAv3 (latenza inferiore del 35%) e EFAv4 (miglioramento del 18% rispetto a v3).
Le opzioni di orchestrazione delle risorse includono Slurm (tramite AWS ParallelCluster, PCS, SageMaker HyperPod) e Kubernetes (tramite EKS, HyperPod EKS).
HyperPod EKS offre governance delle attività, addestramento senza checkpoint e addestramento elastico.
Lo stack software ML include CUDA, NCCL, plugin aws-ofi-nccl, PyTorch e framework come Megatron Core, NeMo, vLLM, SGLang.
L'osservabilità utilizza Prometheus, Grafana, DCGM-Exporter e contatori EFA.
Il post è rivolto a ingegneri e ricercatori di machine learning che lavorano con framework OSS su AWS.

Entità

Istituzioni

Amazon Web Services
NVIDIA
Amazon EC2
Amazon SageMaker HyperPod
Amazon EKS
Amazon FSx for Lustre
Amazon S3
Amazon Managed Service for Prometheus
Amazon Managed Grafana
Hugging Face
PyTorch
JAX
Slurm
Kubernetes
Kueue
Volcano
NVIDIA KAI Scheduler
Megatron Core
NeMo
vLLM
SGLang
NVIDIA Dynamo
NVIDIA Inference Xfer Library
NVIDIA Collective Communications Library
NVIDIA CUDA
NVIDIA Triton
NVIDIA CuTe
CUTLASS
FlashAttention
DeepSpeed
veRL
Libfabric
Prometheus
Grafana
DCGM-Exporter
AWS ParallelCluster
AWS Parallel Computing Service
Karpenter
GDRCopy
UCX
GPUDirect Storage

Fonti

Hugging Face Blog — 2026-05-11