🎯 GPU Architecture & Types
🏗️ GPU Categories in DataCenters
AI/ML Training GPUs (Flagship Performance)
- NVIDIA H100 SXM: 80GB HBM3, 3.35 TB/s memory bandwidth, 989 TF BF16
- NVIDIA A100 SXM: 80GB HBM2e, 2.0 TB/s memory bandwidth, 624 TF BF16
- AMD MI250X: 128GB HBM2e, 3.2 TB/s memory bandwidth, 383 TF BF16
- Intel Ponte Vecchio: 128GB HBM2e, 2.6 TB/s memory bandwidth, 838 TF BF16
- Use Cases: Large language models, computer vision, deep learning research
- Form Factor: SXM modules in 8-GPU baseboard configurations
AI/ML Inference GPUs (Optimized Efficiency)
- NVIDIA L40S: 48GB GDDR6, 864 GB/s bandwidth, 733 TF BF16, AV1 encode/decode
- NVIDIA L4: 24GB GDDR6, 300 GB/s bandwidth, 242 TF BF16, video acceleration
- AMD MI210: 64GB HBM2e, 1.6 TB/s bandwidth, 181 TF BF16
- Use Cases: Real-time inference, edge AI, video processing, recommendation systems
- Deployment: PCIe cards, higher density per rack unit
Multi-Instance GPUs (MIG) & Virtualization
- NVIDIA A100 MIG: Up to 7 instances per GPU (1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb)
- NVIDIA H100 MIG: Up to 7 instances with enhanced isolation
- GPU Partitioning: Dedicated SM units, memory slices, cache isolation
- Use Cases: Multi-tenant inference, development environments, CI/CD pipelines
- Management: Kubernetes device plugins, NUMA topology awareness
🔗 Compute-GPU Pairing Architecture
💻 CPU-GPU Pairing Strategies
Balanced Pairing (General Workloads)
- CPU Configuration: 2x Intel Xeon 8480+ (56 cores) or AMD EPYC 9654 (96 cores)
- GPU Allocation: 4-8x NVIDIA L40S or A100 per dual-socket node
- Memory Ratio: 512GB-1TB system RAM : 48-80GB GPU memory per card
- PCIe Configuration: PCIe 5.0 x16 per GPU, NUMA-aware placement
- Use Cases: Mixed workloads, multi-user environments, development clusters
Compute-Dense Pairing (Training Workloads)
- CPU Configuration: 2x Intel Xeon 8490H (60 cores) or AMD EPYC 9684X (96 cores)
- GPU Allocation: 8x NVIDIA H100 SXM in NVLink domain
- Memory Configuration: 2TB system RAM, 640GB total GPU memory
- Interconnect: NVLink 4.0 at 900GB/s bidirectional between GPUs
- Use Cases: Large model training, distributed computing, HPC simulations
Inference-Optimized Pairing (Production Serving)
- CPU Configuration: Higher core count for request processing (2x AMD EPYC 9754)
- GPU Allocation: 16x NVIDIA L4 in 4U server for maximum density
- Memory Strategy: Lower GPU memory per card, higher system memory for caching
- Network: 200GbE+ for low-latency request handling
- Use Cases: Real-time inference, API serving, edge deployment
🌐 GPU Fabric & Interconnect Architecture
⚡ High-Speed GPU Interconnects
NVIDIA NVLink Architecture
- NVLink 4.0 (H100): 900GB/s bidirectional between GPUs
- NVLink 3.0 (A100): 600GB/s bidirectional between GPUs
- NVSwitch Architecture: 64 NVLink ports at 3.6TB/s aggregate bandwidth
- Multi-Node Scaling: InfiniBand NDR 400Gb/s for inter-node communication
- GPU Direct: Direct memory access between GPUs and network/storage
Multi-Node GPU Cluster Topology
# 8-Node H100 Cluster Configuration
Node Layout (per node):
├── 8x NVIDIA H100 SXM (80GB each)
├── NVSwitch interconnect (all-to-all connectivity)
├── 2x InfiniBand NDR400 HCAs (RDMA capable)
└── 2x Intel Xeon 8490H CPUs
Inter-Node Fabric:
├── InfiniBand Fat-Tree Topology
├── 3:1 oversubscription ratio
├── RDMA over Converged Ethernet (RoCE) alternative
└── GPUDirect RDMA for zero-copy transfers
Total Cluster Specs:
├── 64 H100 GPUs (5,120GB total GPU memory)
├── 63.4 PetaFLOPS BF16 performance
├── 230TB/s intra-node bandwidth
└── 25.6TB/s inter-node bandwidth
AMD GPU Interconnect (xGMI/Infinity Fabric)
- xGMI Links: 92GB/s per link, up to 8 links per GPU
- Infinity Fabric: Coherent memory access across CPU-GPU domain
- Network Integration: ROCm integration with InfiniBand and Ethernet
- Multi-GPU Scaling: Up to 8x MI250X or 4x MI300X per node
Intel GPU Fabric (Xe-Link)
- Xe-Link: 128GB/s bidirectional between Ponte Vecchio GPUs
- Mesh Topology: Direct connections in 4-GPU configurations
- oneAPI Integration: Unified programming model across CPU-GPU
🛠️ GPU Resource Management & Orchestration
🎛️ Kubernetes GPU Management
GPU Device Plugins Configuration
# NVIDIA GPU Operator Deployment
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-operator
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
name: nvidia-device-plugin-ctr
env:
- name: MIG_STRATEGY
value: "mixed" # Support both MIG and full GPU allocation
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
Multi-Instance GPU (MIG) Configuration
# Enable MIG mode on H100 GPUs
nvidia-smi -i 0 -mig 1
# Create MIG instances (example for H100 80GB)
nvidia-smi mig -cgi 1g.10gb,2g.20gb,3g.40gb -C
# Verify MIG configuration
nvidia-smi mig -lgip
# Kubernetes MIG resource requests
apiVersion: v1
kind: Pod
metadata:
name: training-job-small
spec:
containers:
- name: pytorch-training
image: nvcr.io/nvidia/pytorch:23.08-py3
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # Request 1g.10gb MIG slice
memory: "16Gi"
cpu: "8"
requests:
nvidia.com/mig-1g.10gb: 1
memory: "16Gi"
cpu: "8"
nodeSelector:
accelerator: nvidia-h100-mig
Advanced GPU Scheduling
# Node affinity for GPU topology awareness
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
spec:
containers:
- name: training-worker
image: nvcr.io/nvidia/pytorch:23.08-py3
resources:
limits:
nvidia.com/gpu: 8 # Full 8-GPU node
env:
- name: NCCL_IB_DISABLE
value: "0" # Enable InfiniBand for NCCL
- name: NCCL_NET_GDR_LEVEL
value: "PIX" # Enable GPUDirect RDMA
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["gpu-h100-8x"]
- key: topology.kubernetes.io/zone
operator: In
values: ["zone-a"] # Co-locate for low latency
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: distributed-training
topologyKey: kubernetes.io/hostname
⚡ Performance Optimization & Monitoring
📊 GPU Performance Characteristics
Memory Optimization Strategies
- Memory Bandwidth Utilization: Target >80% of peak bandwidth for compute-bound workloads
- Tensor Core Optimization: Use BF16/FP16 data types for 2-4x performance improvement
- Memory Pooling: Pre-allocate memory pools to avoid allocation overhead
- Unified Memory: Use CUDA Unified Memory for large datasets exceeding GPU memory
CUDA Performance Analysis
# GPU utilization monitoring
nvidia-smi dmon -s puct -d 1
# Detailed profiling with Nsight Systems
nsys profile --stats=true python training_script.py
# Memory usage analysis
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
# GPU topology information
nvidia-smi topo -m
# NVLink utilization monitoring
nvidia-smi nvlink --status
nvidia-smi nvlink -gt d # Get NVLink throughput data
Multi-GPU Communication Optimization
- NCCL Tuning: Optimize collective operations for distributed training
- NVLink Utilization: Maximize intra-node GPU-GPU bandwidth
- InfiniBand Optimization: Tune network settings for multi-node scaling
- Gradient Compression: Reduce communication overhead in distributed training
NCCL Performance Tuning
# NCCL environment variables for optimization
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0 # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=PIX # Enable GPUDirect RDMA
export NCCL_IB_HCA=mlx5_0,mlx5_1 # Specify InfiniBand adapters
export NCCL_SOCKET_IFNAME=ib0,ib1 # Network interfaces
# Topology-aware communication
export NCCL_TOPO_FILE=/opt/nccl-topology.xml
export NCCL_TREE_THRESHOLD=0 # Force tree algorithm
export NCCL_RING_THRESHOLD=8 # Ring algorithm threshold
# NCCL test for bandwidth measurement
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8
🌡️ Power & Thermal Management
⚡ Power Consumption & Efficiency
GPU Power Profiles
- NVIDIA H100 SXM: 700W TGP, dynamic power scaling based on workload
- NVIDIA A100 SXM: 400W TGP, enterprise efficiency optimizations
- NVIDIA L4: 72W TGP, inference-optimized power consumption
- AMD MI250X: 560W TGP, advanced power management features
Power Management Commands
# GPU power monitoring
nvidia-smi --query-gpu=power.draw,power.limit --format=csv -l 1
# Set power limits (requires admin privileges)
nvidia-smi -pl 400 # Set power limit to 400W
# Power efficiency monitoring
nvidia-smi --query-gpu=power.draw,utilization.gpu --format=csv -l 1
# Temperature and throttling monitoring
nvidia-smi --query-gpu=temperature.gpu,temperature.memory,clocks_throttle_reasons.hw_slowdown --format=csv -l 1
Cooling System Design
- Liquid Cooling: Direct-to-chip liquid cooling for high-density GPU deployments
- Airflow Management: Hot-aisle/cold-aisle configuration with 35-40°C inlet temperature
- Thermal Throttling: Automatic frequency reduction at 83°C+ to prevent damage
- Fan Curves: Dynamic fan speed control based on GPU temperature and workload
🎮 Interactive GPU Resource Demo
Click any button above to explore GPU resource management capabilities.