S9334: Building And Managing Scalable AI Infrastructure With NVIDIA - - PowerPoint PPT Presentation

s9334 building and managing scalable ai infrastructure
SMART_READER_LITE
LIVE PREVIEW

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA - - PowerPoint PPT Presentation

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software Building your AI Data Center with DGX Reference Architectures Agenda Creating Network Topologies DGX POD Management


slide-1
SLIDE 1

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software

slide-2
SLIDE 2

2

Agenda

  • Building your AI Data Center with DGX Reference

Architectures

  • Creating Network Topologies
  • DGX POD Management Software (DeepOps)
slide-3
SLIDE 3

3

Why Infrastructure Matters to AI

slide-4
SLIDE 4

4

Considerations For On-prem Vs Cloud

CLOUD

  • Early exploration
  • Small datasets in cloud
  • Few experiments
  • Careful prep for each

run to save costs

Keep Compute Where the Data Lives

ON-PREM

  • Deep Learning Enterprise
  • Large datasets on-premises
  • Frequent, rapid experimentation
  • Creative exploration, frequent

training runs

  • Fixed cost infrastructure =

experiment freely

TRAIN CLOSEST TO WHERE YOUR DATA LIVES ✓ Data Sovereignty and Security ✓ Lowest Cost per Training Run ✓ Fail fast, learn FASTER

slide-5
SLIDE 5

5

AI Adopters Impeded By Infrastructure

AI Boosts Profit Margins up to 15% 40% see infrastructure as impeding AI

source: 2018 CTA Market Research

slide-6
SLIDE 6

6

Considerations When Selecting An AI Platform

slide-7
SLIDE 7

7

AI Platform Considerations

Factors impacting deep learning platform decisions

I have limited budget, need lowest up-front cost possible

TOTAL COST OF OWNERSHIP

I want the most GPU bang for the buck

SCALING PERFORMANCE

DEVELOPER PRODUCTIVITY

Must get started now, line of business wants to deliver results yesterday

slide-8
SLIDE 8

8

Comparing AI Compute Alternatives

AI/DL Expertise & Innovation AI/DL Software Stack Operating System Image Hardware Architecture

Looking beyond the “spec sheet”

Evaluation Criteria

slide-9
SLIDE 9

9

The Value Of AI Infrastructure With DGX Reference Architectures

Reference architectures from NVIDIA and leading storage partners

SCALABLE PERFORMANCE

Simplified, validated, converged infrastructure offers

FASTER, SIMPLIFIED DEPLOYMENT TRUSTED EXPERTISE AND SUPPORT

Available through select NPN partners as a turnkey solution

DGX RA Solution

Storage

slide-10
SLIDE 10

10

Simplifying Deployment

slide-11
SLIDE 11

11

AI Success Delayed By Deployment Complexity

Study & exploration Platform Design Productive Experi- mentation HW & SW Integra- tion Trouble- shooting Software eng’g Software

  • ptimiz-

ation Design and Build for Scale Software re-

  • ptimiz-

ation Insights Training at Scale

Designing, Building and Supporting an AI Infrastructure – from Scratch

OPEX

CAPEX

Day 1 Month 3 Time and budget spent on things

  • ther than

data science “DIY” TCO

slide-12
SLIDE 12

12

The Impact Of DGX R/A Solutions On Timeline

Study & exploration Platform Design Productive Experi- mentation Install and Deploy DGX RA SOLUTION Trouble- shooting Softwar e eng’g Software

  • ptimiz-

ation Design and Build for Scale Softwar e re-

  • ptimiz-

ation Insights Training at Scale

  • 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture

Day 1 Month 3

“DIY” TCO

CAPEX

DGX TCO deployment cycle shortened

Wasted time/effort - eliminated

slide-13
SLIDE 13

13

Study & exploration Insights

  • 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture

Day 1 Week 1

Install and Deploy DGX RA SOLUTION

CAPEX

Productive Experi- mentation Training at Scale

“DIY” TCO DGX TCO

The Impact Of DGX R/A Solutions On Timeline

slide-14
SLIDE 14

14

Supporting AI Infrastructure

slide-15
SLIDE 15

15

Supporting AI: Alternative Approaches

Installed/ running Problem!

“My PyTorch CNN model is running 30% slower than yesterday!” “OK let me look into it”

IT Admin

slide-16
SLIDE 16

16

Installed/ running Problem!

Open source / forum Open source / forum

Framework? Libraries? O/S? GPU? Drivers? Server? Network? Storage? Multiple paths to problem resolution

Server, Storage & Network Solution Providers

Supporting AI: Alternative Approaches

slide-17
SLIDE 17

17

Supporting AI With DGX Reference Architecture Solutions

“Update to PyTorch container XX.XX”

AI Expertise NPN Partner Running! Problem!

DGX RA Solution

Storage

DGX RA Solution

Storage

“My PyTorch CNN model is running 30% slower than yesterday!”

IT Admin

slide-18
SLIDE 18

18

Creating Networking Topologies

slide-19
SLIDE 19

19

19

DGX-1 POD Storage Partner Solutions

slide-20
SLIDE 20

20

20

  • NVIDIA DGX-1 | 5x DGX-1 Systems | 5 PFLOPS
  • NETAPP AFF A800 | HA Pair | 364TB | 1M IOPS
  • CISCO | 2x 100Gb Ethernet Switches with RDMA
  • NVIDIA GPU CLOUD DEEP LEARNING STACK |

NVIDIA Optimized Frameworks

  • NETAPP ONTAP 9 | Simplified Data Management
  • TRIDENT | Provision Persistent Storage for DL

SUPPORT

  • Single point of contact support
  • Proven support model

HARDWARE SOFTWARE

Simplify, Accelerate, and Scale the Data Pipeline for Deep Learning

NetApp ONTAP AI

slide-21
SLIDE 21

21

NetApp Network Switch Port Configuration

slide-22
SLIDE 22

22

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25

NetApp VLAN Connectivity for DGX-1 Servers and Storage System Ports

slide-26
SLIDE 26

26

NetApp Storage System Configuration

slide-27
SLIDE 27

27

NetApp Host Configuration

slide-28
SLIDE 28

28

28

AIRI: AI-Ready Infrastructure

  • NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS
  • PURE FLASHBLADE™ | 15x 17TB Blades | 1.5M IOPS
  • CISCO or ARISTA | 2x 100Gb Ethernet Switches

with RDMA

  • NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA

Optimized Frameworks

  • AIRI SCALING TOOLKIT | Multi-node Training Made

Simple

HARDWARE SOFTWARE Extending the power of DGX-1 at-scale in every enterprise

slide-29
SLIDE 29

29

Pure Storage Network Topology

slide-30
SLIDE 30

Rack Design & Builds

slide-31
SLIDE 31

31

31

DDN A3I with DGX-1

  • NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS
  • DDN AI200, AI7990 | 20GB/s | from 30TB |

350K IOPS

  • NETWORK: 2x EDR IB or 100GbE Switches with

RDMA

  • NVIDIA GPU CLOUD DEEP LEARNING STACK

| NVIDIA Optimized Frameworks

  • DDN: High performance, low latency,

parallel file system

  • DDN: In-container client for easy

deployment, efficiency, performance and reliability

HARDWARE SOFTWARE

Making AI-Powered Innovation Easier

slide-32
SLIDE 32

DDN A3I Reference Architecture 9:1 Configuration

slide-33
SLIDE 33

Network Diagram of DDN A3I Benchmark Testing Environment

slide-34
SLIDE 34

Optimized Data Delivery for DGX-1 server with DDN A3I

slide-35
SLIDE 35

DDN Network Diagram of Port-Level Connectivity

1:1 configuration

slide-36
SLIDE 36

4:1 configuration

DDN Network Diagram of Port-Level Connectivity

slide-37
SLIDE 37

9:1 configuration

DDN Network Diagram of Port-Level Connectivity

slide-38
SLIDE 38

38

DGX POD Management Software (DeepOps)

slide-39
SLIDE 39

39

You've Got A Shiny New DGX POD!

What now?

OS Other important considerations Cluster Deployment & Maintenance

Network Storage

Firmware Monitoring Security / Access

Job Scheduling

Airgap

NGC Containers

slide-40
SLIDE 40

40

DeepOps

What is it?

  • Open-source project to facilitate deployment of multi-node GPU clusters for Deep Learning and HPC

environments, in an on-premise, optionally air-gapped datacenter or in the cloud

  • DeepOps is also recognized as the DGX POD Management Software
  • The modular nature of the project also allows more experienced administrators to pick and choose

items that may be useful, making the process compatible with their existing software or infrastructure

  • GitHub: https://github.com/NVIDIA/deepops

Note: You can use DeepOps to configure any NVIDIA GPU-Accelerated platform (and not just DGX servers).

slide-41
SLIDE 41

41

DeepOps: Components

Automated Provisioning

  • PXE Server for OS

installation across cluster

  • Automated configuration

management

Docker registry

  • Deployment of internal

registry

  • Automated mirroring of NGC

containers

Monitoring

  • DCGM
  • Prometheus
  • Grafana

Package repository

  • Deployment of internal Apt-

repository

  • Mirror packages for air-

gapped environments

Logging

  • Filebeat
  • Elasticsearch
  • Kibana

Firmware management

  • Automated, cluster-wide

firmware management

Job Scheduling

  • Kubernetes
  • Slurm

Building out your GPU cluster

DeepOps Components

slide-42
SLIDE 42

42

Here’s What We’ll Build Today

To cluster and beyond!

Network Storage

Provisioning node:

  • Orchestrates the initial

setup of the cluster

Management node(s):

  • Used for cluster

management

Prepare management node(s) GPU Compute node(s):

  • For high-performance

compute workloads

Prepare provisioning node Deploy Kubernetes on management node(s) Provision compute node(s) Deploy basic services on management node(s) Deploy Kubernetes on compute node(s) Run GPU-Accelerated jobs Deploy additional services on management node(s)

DeepOps

Deploy Kubeflow

slide-43
SLIDE 43

43

Architectural Considerations

slide-44
SLIDE 44

44

ARCHITECTURE

Building Multi-node GPU Clusters with DeepOps

  • Odd number of CPU-only management

nodes

○ required for etcd key-value store

  • Mgmt. node(s)

Login node(s) Storage Management / Communication 1/10Gb Ethernet 100Gb EDR InfiniBand / RoCE Compute node(s) Slurm Nodes Kubernetes Nodes Datacenter Network

  • 1x CPU-only login node
  • 1/10Gb Ethernet control & management

networks ○

Management, connectivity, command & control

  • Fully non-blocking fat-tree 100Gb EDR

Infiniband topology ○

Use the biggest EDR IB core switch that fits

slide-45
SLIDE 45

45

ARCHITECTURE

Benefits

  • Augment legacy HPC schedulers with new

features: ○ Cluster management services ○ Jupyter notebooks ○ Deep learning inference deployments (TensorRT)

  • Keep data in the same place, no need to

have separate clusters

  • Free up deep learning researchers to do

DL, not become devsecops/sysadmin

  • Mgmt. node(s)

Login node(s) Storage Management / Communication 1/10Gb Ethernet 100Gb EDR InfiniBand / RoCE Compute node(s) Slurm Nodes Kubernetes Nodes Datacenter Network

slide-46
SLIDE 46

46

ARCHITECTURE

What we’ll cover today

  • Mgmt. node(s)

Login node(s) Storage Management / Communication 1/10Gb Ethernet 100Gb EDR InfiniBand / RoCE Compute node(s) Slurm Nodes Kubernetes Nodes Datacenter Network

slide-47
SLIDE 47

47

Prepare Provisioning Node

slide-48
SLIDE 48

48

Clone the DeepOps Project

  • n the Provisioning Server

GitHub: https://github.com/NVIDIA/deepops

slide-49
SLIDE 49

49

Prep the provisioning node:

  • 1. Install software

dependencies

  • 2. Setup config

directory

slide-50
SLIDE 50

50

Our Provisioning node now has the necessary software.

But how will it orchestrate the cluster deployment?

? ?

slide-51
SLIDE 51

51

Automation: Ansible

  • Open-source automation and configuration

management tool

  • Agentless

(nothing to install on target nodes)

  • Idempotent

(run the same task over-and-over again without any repercussions)

  • Easier to maintain & scale than custom

scripts

  • Playbooks use YAML: easy

to learn and read

Ansible Management Server (Provisioning node) Inventory file (Where) Playbook(s) (.yml) (What)

Target nodes

SSH/API

slide-52
SLIDE 52

52

Prep the provisioning node:

  • 3. Configure Ansible

Inventory

  • 4. Configure Cluster

parameters

slide-53
SLIDE 53

53

Provisioning node:

  • Orchestrates the initial

setup of the cluster

Management node(s):

  • Used for cluster

management

Prepare provisioning node

Network Storage

GPU Compute node(s):

  • For high-performance

compute workloads

slide-54
SLIDE 54

54

Prepare Management Node(s)

Prepare Provisioning Node

slide-55
SLIDE 55

55

Prep management node(s):

  • 1. Bootstrap

management node(s) with Ansible

slide-56
SLIDE 56

56

  • We will deploy a bunch of cluster

services on our Management node(s). For example:

Monitoring

Logging

PXE server

...

  • How can we make the services resilient,

and easy to deploy & manage?

slide-57
SLIDE 57

57

Kubernetes

What is it?

Think of Kubernetes as an operating system for a cluster The cluster’s servers can be on-prem, in the public cloud, or a mix (hybrid) Use Kubernetes to manage nodes in the cluster, administer user access, launch jobs as containers, expose running services externally, and more

DGX DGX DGX DGX DGX DGX DGX DGX DGX

CSP

slide-58
SLIDE 58

58

Kubernetes

How does it work?

Master Node Worker Node Worker Node Worker Node API UI

User Interface

CLI

Comman d Line Interface

Create job Pod Job definition:

  • 1. Use CUDA container
  • 2. Run nvidia-smi command
  • 3. Require 1 GPU

Worker node 0 meets requirements… deploy there. Scheduled, fetch CUDA container, etc

slide-59
SLIDE 59

59

Deploy Kubernetes on Management Node(s)

Prepare Provisioning Node Prepare Management Nodes

slide-60
SLIDE 60

60

Prep management node(s):

  • 2. Deploy Kubernetes
  • n Management

node(s)

slide-61
SLIDE 61

61

Deploy Basic Cluster Services

Prepare Provisioning Node Prepare Management Nodes Kubernetes Management Nodes

slide-62
SLIDE 62

62

What are those initial basic services?

  • Firmware Management for DGX-Servers
  • Internal Package Repository
  • Internal Docker Registry
  • Scheduler
  • ...
  • Optional: DGXIE (DHCP, DNS, PXE-Server)

for OS management

slide-63
SLIDE 63

63

Deploying Internal Package Repository

For Air-Gapped Environments

$ kubectl apply -f services/apt.yml

slide-64
SLIDE 64

64

Deploying Internal Docker Registry

For Air-Gapped Environments

$ helm repo add stable https://kubernetes-charts.storage.googleapis.com $ helm install --values config/registry.yml stable/docker-registry --version 1.4.3 $ ansible-playbook -k ansible/playbooks/docker.yml $ docker pull busybox:latest $ docker tag busybox:latest registry.local/busybox $ docker push registry.local/busybox The container registry will be available to nodes in the cluster at registry.local

slide-65
SLIDE 65

65

Provisioning node:

  • Orchestrates the initial

setup of the cluster

Management node(s):

  • Used for cluster

management

Prepare management node(s) Prepare provisioning node Deploy Kubernetes on management node(s) Deploy basic services on management node(s)

Network Storage

GPU Compute node(s):

  • For high-performance

compute workloads

slide-66
SLIDE 66

66

Provision Compute Node(s)

Prepare Provisioning Node Prepare Management Nodes Kubernetes Management Nodes Basic Cluster Services

slide-67
SLIDE 67

67

Provisioning Compute Nodes

Options

  • Manual:

○ BMC Remote Console

Bootable USB

  • Automated:

○ Third-party tools: ■ Foreman ■ MAAS ■ … ○ DeepOps: ■ OS installation container ■ Detailed steps available on GitHub: https://github.com/NVIDIA/deepops/

slide-68
SLIDE 68

68

Provision compute node(s): With the OS installed, we’re ready to configure our nodes

slide-69
SLIDE 69

69

Provisioning node:

  • Orchestrates the initial

setup of the cluster

Management node(s):

  • Used for cluster

management

Prepare management node(s) Compute node(s):

  • Used for computing stuff

Prepare provisioning node Deploy Kubernetes on management node(s) Provision compute node(s) Deploy basic services on management node(s)

Network Storage

slide-70
SLIDE 70

70

Deploy Additional Services

Prepare Provisioning Node Prepare Management Nodes Kubernetes Management Nodes Basic Cluster Services Provision Compute Nodes

slide-71
SLIDE 71

71

Other services:

  • Monitoring
  • Logging
  • Ingress controller
  • Authentication
  • ...

See the DeepOps project for instructions: https://github.com/NVIDIA/deepops

slide-72
SLIDE 72

72

Deploy additional services:

  • 1. Deploy monitoring

service

slide-73
SLIDE 73

73

Deploy Kubernetes on Compute Node(s)

Prepare Provisioning Node Prepare Management Nodes Kubernetes Management Nodes Basic Cluster Services Provision Compute Nodes Additional Cluster Services

slide-74
SLIDE 74

74

But why Kubernetes?

slide-75
SLIDE 75

75

Many users, many nodes

On-prem

Many users, single node

On-prem

DGX DGX DGX DGX DGX DGX DGX DGX DGX

Scheduler

DGX Station

Scheduler

Many users, single node

On-prem

DGX Station

Direct Access through SSH

  • Users submit jobs to scheduler & retrieve results (non-

interactive; batch)

  • Users request a limited interactive session through the

scheduler

slide-76
SLIDE 76

76

Scheduler

Basic scheduling features Share nodes, schedule jobs for GPUs on a node (current best solution: Excel spreadsheet) Covers data permissions and security (LDAP, file permissions) Adds analytics and monitoring (important also for justification of purchase)

Kubernetes

Advanced scheduling features Multi-node jobs Job dependencies, workflows, DAGs Advanced reservations Intelligent scheduling (not just FIFO) … Other HPC-like scheduling functionality

SLURM

slide-77
SLIDE 77

77

Deploy additional services:

  • 1. Deploy Kubernetes
  • n Compute

node(s)

slide-78
SLIDE 78

78

Prepare Provisioning Node Prepare Management Nodes Kubernetes Management Nodes Basic Cluster Services Provision Compute Nodes Additional Cluster Services Kubernetes

  • n Compute

Nodes

slide-79
SLIDE 79

79

Submitting jobs to compute nodes Submit a GPU-enabled job to the GPU compute node(s)

slide-80
SLIDE 80

80

Provisioning node:

  • Orchestrates the initial

setup of the cluster

Management node(s):

  • Used for cluster

management

Prepare management node(s) Prepare provisioning node Deploy Kubernetes on management node(s) Provision compute node(s) Deploy basic services on management node(s) Deploy Kubernetes on compute node(s) Deploy Kubeflow Deploy additional services on management node(s) GPU Compute node(s):

  • For high-performance

compute workloads Network Storage

slide-81
SLIDE 81

81

DeepOps: Components

Automated Provisioning

  • PXE Server for OS

installation across cluster

  • Automated configuration

management

Docker registry

  • Deployment of internal

registry

  • Automated mirroring of NGC

containers

Monitoring

  • DCGM
  • Prometheus
  • Grafana

Package repository

  • Deployment of internal Apt-

repository

  • Mirror packages for air-

gapped environments

Logging

  • Filebeat
  • Elasticsearch
  • Kibana

Firmware management

  • Automated, cluster-wide

firmware management

Job Scheduling

  • Kubernetes
  • Slurm

Summary

DeepOps Components

slide-82
SLIDE 82