Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 - PowerPoint PPT Presentation

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat

Agenda ● Red Hat + NVIDIA Partnership Overview ● Announcements / What’s New ● OpenShift + GPU Integration Details NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 2

Where Red Hat Partners with NVIDIA ● GPU accelerated workloads in the enterprise ○ AI/ML and HPC ● Deploy and manage NGC containers ○ On-prem or public cloud ● Managing virtualized resources in the data center ○ vGPU for technical workstation ● Fast deployment of GPU resources with Red Hat ○ Easy to use driver framework NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat/NVIDIA Technology Partnership Timeline RH Summit - AI booth & vGPU/RHV - Nvidia GTC 2018 & OpenShift Commons/ Nvidia GTC 2017: OpenShift Partner Joint Webinar - Kubernetes WG mtg - RH KubeCon; Deep SC2017 Red Hat vGPU Theatre & RH AI/ML Oil & Gas Use vGPU & Kubernetes Learning on RH/Nvidia (booth Roadmap Update Strategy sessions Case sessions, RH sponsorship OpenShift w/GPUs demos, talks) Nov’17 Mar’18 Apr’18 May’18 Oct’18 Mar’19 May‘17 Nov’17 Mar’18 May’18 Jun’18 Dec’18 NVIDIA GTC 2019; RHV4.2/vGPU 6.1 NVIDIA GTC DC; LSF & MM Summit: STAC-A2 Benchmark 2018 Rice Oil & Gas RHEL & OpenShift & CUDA9.2 Annc. RHEL & OpenShift Nouveau Driver (Nvidia/HPe/RHEL-STAC HPC Conf Certification on Certification on demo Conf NYC, RH & Nvidia (vGPU/RHV) DGX-2 / T4 GPU DGX-1 Blogs) Server Configs, RH Sponsorship NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat + NVIDIA: What’s New? ● Red Hat Enterprise Linux Certification on DGX-1 & DGX-2 systems ○ Support for Kubernetes-based, OpenShift Container Platform ○ NVIDIA GPU Cloud (NGC) containers to run on RHEL and OpenShift ● Red Hat’s OpenShift provides advanced ways of managing hardware to best leverage GPUs in container environments ● NVIDIA developed precompiled driver packages to simplify GPU deployments on Red Hat products ● NVIDIA’s latest T4 GPUs are available on Red Hat Enterprise Linux ○ T4 Server with RHEL support from most major OEM server vendors ○ T4 servers are “NGC-Ready” to run GPU containers NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat + NVIDIA: Open Source Collaboration Open Source Projects • Heterogeneous Memory Management (HMM) • Memory management between device and CPU • Nouveau Driver • Graphics device driver for NVIDIA GPU • Mediated Devices (mdev) • Enabling vGPU through the Linux kernel framework • Kubernetes Device Plugins • Fast and direct access to GPU hardware • Run GPU enabled containers in Kubernetes cluster NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat OpenShift Container Platform

OPENSHIFT - CONTAINER PLATFORM FOR AI Enable Kubernetes clusters to seamlessly run accelerated AI workloads in containers Red Hat is delivering required functionality to efficiently run OCP NODE OCP NODE OCP NODE OCP MASTER AI/ML workloads on OpenShift c C C API/AUTHENTICATIO N 3.10, 3.11 C C C ● DATA STORE RHEL RHEL RHEL Device plugins provide access to FPGAs, GPGPUs, SoC and ○ SCHEDULER OCP NODE OCP NODE OCP NODE other specialized HW to applications running in containers CPU Manager provides containers with exclusive access to HEALTH/SCALING ○ C C C C compute resources, like CPU cores, for better utilization RED HAT C ENTERPRISE LINUX Huge Pages Support enables containers with large ○ RHEL RHEL RHEL memory requirements to run more efficiently 4.0 GPU-enabled server ● with Red Hat Enterprise Linux and Multi-network feature allows more than one network ○ interface per container for better traffic management OpenShift Container platform (OCP) 8

One Platform to... NFV OpenShift is the single platform Machine FSI to run any application: Learning ● Old or new HPC ISVs ● Monolithic/Microservice Big Data Animation NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 9 9

Data Scientist User Experience (Service Catalog) NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Upstream First: Kubernetes Working Groups ● Resource Management Working Group Features Delivered ○ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ■ Extensive Roadmap ○ Intel, IBM, Google, NVIDIA, Red Hat, many more... ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 11

Upstream First: Kubernetes Working Groups ● Network Plumbing Working Group Formalized Dec 2017 ○ Implemented a multi-network specification: ● https://github.com/K8sNetworkPlumbingWG/multi-net-spec (collection of CRDs for multiple networks, owned by sig-network) Reference Design implemented in Multus CNI by Red Hat ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 12

GPU Cluster Topology

What does an OpenShift (OCP) Cluster look like? Control Plane Infrastructure LB registry registry registry master master master and and and and etcd and etcd and etcd router router router Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 14

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ MachineSets ○ Mark them as “special” ○ Taints/Tolerations Compute and GPU Nodes ○ Priority/Preemption GPU GPU ○ ExtendedResourceTole GPU GPU ration NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 15

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctls Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 16

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 17

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin ○ Deploy monitoring Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 18

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 19

Support Components

Cluster Node Tuning Operator (tuned) OpenShift node-level tuning operator ● Consolidate/Centralize node-level tuning (openshift-ansible) ● Set tunings for Elastic/Router/SDN ● Add more flexibility to add custom tuning specified by customers ● NVIDIA DGX-1 & DGX-2 Tuned Profiles 21 INSERT DESIGNATOR, IF NEEDED

Node Feature Discovery Operator (NFD) Steer workloads based on infrastructure capabilities Git Repos: ● Labels: feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpuid-AVX2=true Upstream ○ feature.node.kubernetes.io/cpuid-SSE4.2=true Downstream ○ feature.node.kubernetes.io/kernel-selinux.enabled=true Client/Server model ● feature.node.kubernetes.io/kernel-version.full=3.10.0-957.5.1.el7.x86_64 Customize with “hooks” ● feature.node.kubernetes.io/pci-0300_10de.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.VERSION_ID=4.0 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=0 NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

NFV Partner Engineering along with the Network Plumbing Working Group is using Multus as part of a reference implementation. Multus CNI is a “meta plugin” for https://github.com/intel/multus-cni Kubernetes CNI which enables one to attach multiple network interfaces on each pod. It allows one to assign a CNI plugin to each interface created in the pod. 23

THE PROBLEM (Today) #1 Each pod only has one #2 Each master/node has only network interface one static CNI configuration Kubernetes Kubernetes Master/Node Master/Node so. static. Pod A eth0 flannel 24

THE SOLUTION (Today) Static CNI configuration points Each subsequent CNI plugin, as called by Multus, has to Multus configurations which are defined in CRD objects Kubernetes Kubernetes Master/Node Master/Node CRDs Pod annotation I’d like a flannel interface, and a macvlan interface please. flannel macvlan Pod C eth0 net0 flannel macvlan Sure thing bud, I’ll pull up the configurations stored in CRD objects. 25

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 - PowerPoint PPT Presentation

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat Agenda Red Hat + NVIDIA Partnership Overview Announcements / Whats New OpenShift + GPU Integration Details NVIDIA GTC 2019:

[LE,RO] red red red red red red red red red red red red red red red red red red

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created DGX-2

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Presentation Workout: The 10 Tried-and-Tested Steps Presentation Workout: The 10 Tried-and-Tested

Working with Practitioners to update our Homelessness system Tried and tested Tried and

SUDS: Innovation or a Tried and SUDS: Innovation or a Tried and Tested Practice? Tested

Presentation Workout: The 10 Tried-and-Tested Steps That Will Build Presentation Workout: The 10

S9164 S9164 Adv Advanced nced We Weather In Inform rmatio ion Re Recall wi with th DGX DGX

NVidia vGPU and Red Hat Virtualization Virtual High End Workstations and Compute April 2017

Red Eyes, Red Spots, and Red Flags Red Eyes Common reason for primary care visits Red

S9299 NVIDIA VGPU ON RED HAT LINUX HYPERVISOR (RHV) Shailesh Deshmukh Senior Solution Architect,

How nCipher HSMs enhance security for Red Hat OpenShift Platform www.ncipher.com Red Hat and

Introduction to Red Hat ALBERT WONG Solution Architect, Red Hat # 1 OPEN SOURCE LEADER 90

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

Red Hat Ceph Storage Free Test Drive Environment Introduction Karan Singh Sr. Storage Architect

The Romo Dallaire Child Soldiers Initiative Ending the use of Child Soldiers from a Preventative

New Tax Collector Workshop The Ethical Administration of Governmental Financial Resources March

Mastering Perl by brian d foy The Perl Review version 1.71 July 15, 2010 The Perl Review

Experiences with the Carnegie Mellon Binary Analysis Platform (CMU BAP) Sam L. Thomas, CNRS,

Openness Forum May 23-25, 2017 Kyiv, Ukraine UNCITRAL Perspective on Curbing Corruption

BEST PRACTICES: FIRING MINIMIZING RISK AND AVOIDING LITIGATION June 25, 2013 Presented by:

COIN Secure and Anonymous Decentralized Bitcoin Mixing PARTY Jan

STAR Conference London, 6 October 2016 Company Presentation Agenda 1. Overview 2.

Sambuz

Useful Links

Newsletter

Mail Us