S9334: Building And Managing Scalable AI Infrastructure With NVIDIA - PowerPoint PPT Presentation

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software

• Building your AI Data Center with DGX Reference Architectures Agenda • Creating Network Topologies • DGX POD Management Software (DeepOps) 2

Why Infrastructure Matters to AI 3

Considerations For On-prem Vs Cloud Keep Compute Where the Data Lives TRAIN CLOSEST TO WHERE YOUR DATA LIVES CLOUD ON-PREM ✓ Data Sovereignty • Early exploration • Deep Learning Enterprise and Security Small datasets in cloud • • Large datasets on-premises ✓ Lowest Cost per Few experiments • Training Run • Frequent, rapid experimentation • Careful prep for each ✓ Fail fast, learn • Creative exploration, frequent run to save costs training runs FASTER Fixed cost infrastructure = • experiment freely 4

AI Adopters Impeded By Infrastructure 40% see infrastructure AI Boosts Profit as impeding AI Margins up to 15% source: 2018 CTA Market Research 5

Considerations When Selecting An AI Platform 6

AI Platform Considerations Factors impacting deep learning platform decisions TOTAL COST OF DEVELOPER SCALING OWNERSHIP PRODUCTIVITY PERFORMANCE “ “ “ I have limited budget, Must get started now, I want the most GPU need lowest up-front line of business wants to bang for the buck cost possible deliver results yesterday 7

Comparing AI Compute Alternatives Looking beyond the “spec sheet” AI/DL Expertise & Innovation Evaluation Criteria AI/DL Software Stack Operating System Image Hardware Architecture 8

The Value Of AI Infrastructure With DGX Reference Architectures FASTER, SIMPLIFIED TRUSTED EXPERTISE SCALABLE PERFORMANCE DEPLOYMENT AND SUPPORT DGX RA Solution Storage Reference architectures from Simplified, validated, Available through select NPN NVIDIA and leading storage partners converged infrastructure offers partners as a turnkey solution 9

Simplifying Deployment 10

AI Success Delayed By Deployment Complexity “DIY” TCO Time and budget spent on things other than data science CAPEX Day Month 1 3 OPEX Design Software Study & Software Platform HW & SW Trouble- Software Productive Training Insights and Build Integra- optimiz- re- exploration eng’g Design shooting Experi- at Scale for Scale tion ation optimiz- mentation ation Designing, Building and Supporting an AI Infrastructure – from Scratch 11

The Impact Of DGX R/A Solutions On Timeline “DIY” TCO Wasted time/effort - eliminated DGX deployment TCO cycle shortened Day Month CAPEX 1 3 Install Design Softwar Study & Softwar Platform and Trouble- Software Productive Training Insights and Build optimiz- e re- exploration e eng’g Design Deploy shooting Experi- at Scale for Scale ation optimiz- DGX RA mentation ation SOLUTION 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture 12

The Impact Of DGX R/A Solutions On Timeline “DIY” TCO DGX TCO Week Day CAPEX 1 1 Install Insights and Study & Productive Training Deploy exploration Experi- at Scale DGX RA mentation SOLUTION 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture 13

Supporting AI Infrastructure 14

Supporting AI: Alternative Approaches “My PyTorch CNN model is running 30% slower than yesterday!” IT Admin Installed/ Problem! “OK let me look into it” running 15

Supporting AI: Alternative Approaches Multiple paths to problem resolution Framework? Libraries? Open source / forum O/S? Open source / forum GPU? Drivers? Installed/ Problem! Server? running Network? Storage? Server, Storage & Network Solution Providers 16

Supporting AI With DGX Reference Architecture Solutions “My PyTorch CNN model is running 30% slower NPN than yesterday!” Partner AI Expertise IT Admin DGX RA DGX RA Solution Solution Storage Storage Problem! Running! “Update to PyTorch container XX.XX” 17

Creating Networking Topologies 18

DGX-1 POD Storage Partner Solutions 19 19

NetApp ONTAP AI Simplify, Accelerate, and Scale the Data Pipeline for Deep Learning HARDWARE • NVIDIA DGX-1 | 5 x DGX-1 Systems | 5 PFLOPS NETAPP AFF A800 | HA Pair | 364TB | 1M IOPS • CISCO | 2x 100Gb Ethernet Switches with RDMA • SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK | • NVIDIA Optimized Frameworks • NETAPP ONTAP 9 | Simplified Data Management TRIDENT | Provision Persistent Storage for DL • SUPPORT Single point of contact support • Proven support model • 20 20

NetApp Network Switch Port Configuration 21

NetApp VLAN Connectivity for DGX-1 Servers and Storage System Ports 25

NetApp Storage System Configuration 26

NetApp Host Configuration 27

AIRI: AI-Ready Infrastructure Extending the power of DGX-1 at-scale in every enterprise HARDWARE • NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS PURE FLASHBLADE™ | 15x 17TB Blades | 1.5M IOPS • CISCO or ARISTA | 2x 100Gb Ethernet Switches • with RDMA SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA • Optimized Frameworks • AIRI SCALING TOOLKIT | Multi-node Training Made Simple 28 28

Pure Storage Network Topology 29

Rack Design & Builds

DDN A3I with DGX-1 Making AI-Powered Innovation Easier HARDWARE • NVIDIA DGX-1 | 4 x DGX-1 Systems | 4 PFLOPS • DDN AI200, AI7990 | 20GB/s | from 30TB | 350K IOPS • NETWORK: 2 x EDR IB or 100GbE Switches with RDMA SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK • | NVIDIA Optimized Frameworks DDN: High performance, low latency, • parallel file system DDN: In-container client for easy • deployment, efficiency, performance and reliability 31 31

DDN A3I Reference Architecture 9:1 Configuration

Network Diagram of DDN A3I Benchmark Testing Environment

Optimized Data Delivery for DGX-1 server with DDN A3I

DDN Network Diagram of Port-Level Connectivity 1:1 configuration

DGX POD Management Software (DeepOps) 38

You've Got A Shiny New DGX POD! What now? Cluster Deployment & Maintenance Security Job Monitoring OS Firmware / Access Scheduling Other important considerations Network NGC Airgap Containers Storage 39

DeepOps What is it? • Open-source project to facilitate deployment of multi-node GPU clusters for Deep Learning and HPC environments, in an on-premise, optionally air-gapped datacenter or in the cloud • DeepOps is also recognized as the DGX POD Management Software The modular nature of the project also allows more experienced administrators to pick and choose • items that may be useful, making the process compatible with their existing software or infrastructure • GitHub: https://github.com/NVIDIA/deepops Note: You can use DeepOps to configure any NVIDIA GPU-Accelerated platform (and not just DGX servers). 40

Building out your GPU cluster DeepOps Components Automated Provisioning Firmware management ● PXE Server for OS ● Automated, cluster-wide installation across cluster firmware management ● Automated configuration DeepOps: management Components Package repository Docker registry ● Deployment of internal Apt- ● Deployment of internal repository registry ● Mirror packages for air- ● Automated mirroring of NGC gapped environments containers Job Scheduling Logging Monitoring ● Filebeat ● DCGM ● Kubernetes ● Elasticsearch ● Prometheus ● Slurm ● Kibana ● Grafana 41

Here’s What We’ll Build Today Deploy Kubeflow To cluster and beyond! Run GPU-Accelerated jobs Deploy Kubernetes on compute node(s) GPU Compute node(s): ● For high-performance Deploy additional services on management compute workloads node(s) DeepOps Provision compute node(s) Management node(s): ● Used for cluster Deploy basic services on management management node(s) Network Deploy Kubernetes on management node(s) Prepare management node(s) Provisioning node: ● Orchestrates the initial Storage setup of the cluster Prepare provisioning node 42

Architectural Considerations 43

ARCHITECTURE Building Multi-node GPU Clusters with DeepOps Datacenter Network Login node(s) Mgmt. node(s) ● 1x CPU-only login node ● Odd number of CPU-only management nodes required for etcd key-value store ○ Management / Communication 1/10Gb Ethernet ● 1/10Gb Ethernet control & management Compute node(s) Storage networks ○ Management, connectivity, command & Kubernetes Nodes Slurm Nodes control ● Fully non-blocking fat-tree 100Gb EDR Infiniband topology ○ Use the biggest EDR IB core switch that fits 100Gb EDR InfiniBand / RoCE 44

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA - PowerPoint PPT Presentation

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software Building your AI Data Center with DGX Reference Architectures Agenda Creating Network Topologies DGX POD Management

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

A Scalable Cross- -Platform Platform A Scalable Cross Infrastructure for Application

MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR ADVANCING FOOD

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY OBSERVED

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Energy Complex (EnCo) (New and Existing Building) 117,859 m 2 Building A 61,45 8 m 2 Building B

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

A Scalable Tools Communication Infrastructure presented by Richard L. Graham Motivation

Building a scalable, reliable and maintainable execution infrastructure for your Selenium Tests

Building a Scalable Infrastructure for Autonomous Adaptive Agents Jun Suzuki, Ph.D.

Secure Scalable CCT Secure Scalable CCTV, Mobile, and W Mobile, and Wearable earable Video F

HeRAMS Health Resources Availability Mapping System A quick presentation Global Health Cluster /

Agro ICT Cluster Company Presentation INTRODUCTION OUR MEMBERS Kern Hungarian National NAK

For each Career Cluster Presentation, choose TWO Careers to take notes on. 1. Career Cluster:

Cottage Cluster Housing in Corvallis OPAL-City Club of Corvallis Background Who? Corvallis

EUREKA CLUSTERS AI CALL Simon Haafs Bob van der Bijl Funding outlook Budget: 5 mln.

High Performance Computing with do doAzur ureP ePar arallel allel Using Azure as your

EFFECTIVE DATA PRESENTATION TECHNIQUES Prof. Phil Murphy and Prof. Moyara Ruehsen Tuftes

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

Sambuz

Useful Links

Newsletter

Mail Us