Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - PowerPoint PPT Presentation

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST

What’s next in technology and innovation? What’s next in technology and innovation? Kubernetes & AI Kubernetes & AI with Run:AI, Red Hat & Excelero with Run:AI, Red Hat & Excelero AI WEBINAR AI WEBINAR Presenter: Presenter: Presenter: Presenter: Presenter: Presenter: Your Host: Your Host: William Benton William Benton Omri Geller Omri Geller Gil Vitzinger Gil Vitzinger Tom Leyden Tom Leyden Engineering Manager Engineering Manager CEO & Co-Founder CEO & Co-Founder Software Developer Software Developer VP Marketing VP Marketing

Kubernetes for AI Workloads Omri Geller, CEO and co-founder, Run:AI

A Bit of History Reproducibility and Needed flexibility portability and better utilization Bare Metal Virtual Machines Containers Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration 2

Enter Kubernetes Track, Create Efficient Execute Across Schedule and Cluster Different Operationalize Utilization Hardware 3

Today, 60% of Those Who Deploy Containers Use K8s for Orchestration* *CNCF 4

Now let’s talk about AI

Computing Power Fuels Development of AI Deep Learning Classical Machine Learning Manual Engineering 6

Artificial Intelligence is a Completely Different Ballgame New Distributed Experimentation accelerators computing R&D 7

Data Science Workflows and Hardware Accelerators are Highly Coupled Data Hardware scientists accelerators Constant Workflow Under-utilized hassles Limitations GPUs 8

This Leads to Frustration on Both Sides IT leaders are Data Scientists are frustrated – GPU frustrated – speed and utilization is low productivity are low 9

AI Workloads are Also Built on Containers NGC – Nvidia pre-trained models for AI Container ecosystem for Data experimentation on docker containers Science is growing 10

How Can We Bridge The Divide? 11

Kubernetes, the “De-facto” Standard for Container Orchestration Lacks the Multiple queues following Automatic queueing/de-queueing capabilities: Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads 12

How is Experimentation Different? Training Build 13

Distinguishing Between Build and Training Workflows Training Build • Development & debugging • Interactive sessions • Short cycles • Performance is less important • Low GPU utilization 14

Distinguishing Between Build and Training Workflows Training Build • Training & HPO • Development & debugging • Remote execution • Interactive sessions • Long workloads • Short cycles • Throughput is highly important • Performance is less important • High GPU utilization • Low GPU utilization 15

How to Solve? Guaranteed Quotas Guaranteed quotas Fixed quotas • Fits training workflows • Fits build workloads • Users can go over quota • GPUs are always available 16

Solution: Guaranteed Quotas Guaranteed quotas Fixed quotas • Fits training workflows • Fits build workloads • Users can go over quota • GPUs are always available • More concurrent experiments • More multi-GPU training 17

Queueing Management Mechanism 18

Run:AI - Stitching it All Together

Run:AI - Applying HPC Concepts to Kubernetes With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap Data Science teams IT teams gain visibility gain productivity and maximal GPU and speed utilization 20

Run:AI - Kubernetes-Based Abstraction Layer INTEGRABLE Easily integrates with IT and Data Science platforms MULTI-CLOUD Run on any public, private and hybrid cloud environment IT GOVERNANCE Policy based orchestration and queuing management 21

Run:AI Utilize Kubernetes across IT to improve resource utilization Speed up experimentation process and time to market Easily scale infrastructure to meet needs of the business 22

From 28% to 73% utilization, 2X speed, and $1M savings Challenge Solution After implementing Run:AI’s platform 28% AVERAGE GPU UTILIZATION - 73% AVERAGE GPU UTILIZATION inefficient and underutilized resources • Enabled 2x more experiments to run • Saved $1M in additional GPU expenditures for 2020 23

Run:AI at-a-Glance • Founded in 2018 • Backed by top VCs • Offices in Tel Aviv, New York, and Boston Venture • Fortune 500 customers Funded • Top cloud and virtualization engineers 24

Thank you

NVMesh in Kubernetes

29 What is NVMesh CSI Driver ● What is NVMesh CSI Driver ? ○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes ● Main Features ○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs

30 CSI Driver Components Kubernetes Controller REST API NVMesh CSI Controller NVMesh CSI NVMesh CSI NVMesh CSI Node Driver Node Driver Node Driver NVMesh NVMesh NVMesh NVMesh Management Client Client Client NVMesh Targets

31 Dynamic Provisioning & Attach Flow Kubernetes Controller User creates a Persistent Volume Claim (PVC) NVMesh CSI Controller Create Volume NVMesh Management NVMesh Targets

32 Dynamic Provisioning & Attach Flow Kubernetes Controller User creates a POD that uses the PVC Nod NVMesh CSI e NVMesh CSI Controller User App Node Driver PODs POD mount Attach / Detach K8s internal mount OS mount NVMesh NVMesh Management /dev/nvmesh/v1 Client Data NVMesh Targets

33 Exposing NVMesh volume in a Pod User App POD User App POD 1 2 kubelet/pod1/volumes/v1 kublete/pod2/volumes/v1 CSI Publish Volume bind mount For each volume for each POD kubelet/volume/mount Block Volume FileSystem Volume mount CSI Stage Volume Once for each Volume on the Node mkfs /dev/nvmesh/v1 NVMesh attach NVMesh Client

34 Usage Examples kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes: - ReadWriteMany volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10 kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg

35 Summary NVMesh Benefits for Kubernetes: ● Persistent storage that scales for stateful applications ● Predictable application performance – ensure that storage is not a bottleneck ● Scale your performance and capacity linearly ● Containers in a pod can access persistent storage presented to that pod, but with the freedom to restart the pod on an alternate physical node ● Choice of Kubernetes PVC access mode to match the storage to the application and file system requirements

Machine learning discovery, workflows, and systems on Kubernetes William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc.

codifying problem   data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation

codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation

machine data monitoring resource verification management data collection configuration serving infrastructure analysis tools process feature extraction management (Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)

data engineers transform events transform federate databases archive file, object transform storage

transform developer UI events transform federate databases file, object transform storage train models data scientists

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - PowerPoint PPT Presentation

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST Whats next in technology and innovation? Whats next in technology and innovation? Kubernetes & AI Kubernetes & AI with

[LE,RO] red red red red red red red red red red red red red red red red red red

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Airflow on Kubernetes: Containerizing your Workflows By Michael Hewitt Agenda Kubernetes

Kubernetes on ARM64 Kubernetes on ARM64 Raspberry PI 4 Kubernetes cloud for a Raspberry PI 4

Red Eyes, Red Spots, and Red Flags Red Eyes Common reason for primary care visits Red

Matthias Sohn Adel Zaalouk SAP From Containers to Kubernetes From Containers to Kubernetes

Introduction to Red Hat ALBERT WONG Solution Architect, Red Hat # 1 OPEN SOURCE LEADER 90

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

Red Hat Ceph Storage Free Test Drive Environment Introduction Karan Singh Sr. Storage Architect

How nCipher HSMs enhance security for Red Hat OpenShift Platform www.ncipher.com Red Hat and

RUNNING VIRTUAL MACHINES ON KUBERNETES Roman Mohr & Fabian Deutsch, Red Hat, KVM Forum, 2017

Update on Kubernetes Storage and OpenStack Integration Huamin Chen CTO Office, Red Hat Volume

Rootless Kubernetes Running Kubernetes and CRI/OCI Runtimes as an unprivileged user Akihiro Suda

Red fox By Hunter.K Red fox traits A Red fox is a mammal.(Mammals have hair and are warm

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

The Long and Short of Passwords Rich Shay November 5, 2009 1 / 34 The Long and Short of

Nieuw Leyden Alexander de Vries Director Nieuw Leyden Expertteam Self build Netherlands

OBJECTIVES 2 What is EndNote? EndNote X7 ( Installation & technical issues)

Content p Paradox of choice and information overload p Personalization p Recommender

How Orange Successfully Deploys GPU Infrastructure for AI AI WEBINAR Date/Time: Tuesday, June 23

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N.

AND FIELD PLACEMENT PROGRAMS AT UCONN LAW SCHOOL 2015-16 Practice-Based Learning

Justifications and Wrong Judgements Giuseppe Primiero FWO - Research Foundation Flanders Centre

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - PowerPoint PPT Presentation

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST Whats next in technology and innovation? Whats next in technology and innovation? Kubernetes & AI Kubernetes & AI with

[LE,RO] red red red red red red red red red red red red red red red red red red

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Airflow on Kubernetes: Containerizing your Workflows By Michael Hewitt Agenda Kubernetes

Kubernetes on ARM64 Kubernetes on ARM64 Raspberry PI 4 Kubernetes cloud for a Raspberry PI 4

Red Eyes, Red Spots, and Red Flags Red Eyes Common reason for primary care visits Red

Matthias Sohn Adel Zaalouk SAP From Containers to Kubernetes From Containers to Kubernetes

Introduction to Red Hat ALBERT WONG Solution Architect, Red Hat # 1 OPEN SOURCE LEADER 90

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

Red Hat Ceph Storage Free Test Drive Environment Introduction Karan Singh Sr. Storage Architect

How nCipher HSMs enhance security for Red Hat OpenShift Platform www.ncipher.com Red Hat and

RUNNING VIRTUAL MACHINES ON KUBERNETES Roman Mohr &amp; Fabian Deutsch, Red Hat, KVM Forum, 2017

Update on Kubernetes Storage and OpenStack Integration Huamin Chen CTO Office, Red Hat Volume

Rootless Kubernetes Running Kubernetes and CRI/OCI Runtimes as an unprivileged user Akihiro Suda

Red fox By Hunter.K Red fox traits A Red fox is a mammal.(Mammals have hair and are warm

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

The Long and Short of Passwords Rich Shay November 5, 2009 1 / 34 The Long and Short of

Nieuw Leyden Alexander de Vries Director Nieuw Leyden Expertteam Self build Netherlands

OBJECTIVES 2 What is EndNote? EndNote X7 ( Installation &amp; technical issues)

Content p Paradox of choice and information overload p Personalization p Recommender

How Orange Successfully Deploys GPU Infrastructure for AI AI WEBINAR Date/Time: Tuesday, June 23

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N.

AND FIELD PLACEMENT PROGRAMS AT UCONN LAW SCHOOL 2015-16 Practice-Based Learning

Justifications and Wrong Judgements Giuseppe Primiero FWO - Research Foundation Flanders Centre

RUNNING VIRTUAL MACHINES ON KUBERNETES Roman Mohr & Fabian Deutsch, Red Hat, KVM Forum, 2017

OBJECTIVES 2 What is EndNote? EndNote X7 ( Installation & technical issues)