Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - - PowerPoint PPT Presentation

kubernetes ai with run ai red hat excelero
SMART_READER_LITE
LIVE PREVIEW

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - - PowerPoint PPT Presentation

Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST Whats next in technology and innovation? Whats next in technology and innovation? Kubernetes & AI Kubernetes & AI with


slide-1
SLIDE 1

AI WEBINAR

Date/Time: Tuesday, June 9 | 9 am PST

Kubernetes & AI with Run:AI, Red Hat & Excelero

slide-2
SLIDE 2

Presenter: Omri Geller CEO & Co-Founder Your Host: Tom Leyden VP Marketing AI WEBINAR

What’s next in technology and innovation?

Kubernetes & AI with Run:AI, Red Hat & Excelero

Presenter: William Benton Engineering Manager Presenter: Gil Vitzinger Software Developer Presenter: Omri Geller CEO & Co-Founder Your Host: Tom Leyden VP Marketing AI WEBINAR

What’s next in technology and innovation?

Kubernetes & AI with Run:AI, Red Hat & Excelero

Presenter: William Benton Engineering Manager Presenter: Gil Vitzinger Software Developer

slide-3
SLIDE 3

Kubernetes for AI Workloads

Omri Geller, CEO and co-founder, Run:AI

slide-4
SLIDE 4

A Bit of History

2

Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration

Bare Metal

Needed flexibility and better utilization

Virtual Machines

Reproducibility and portability

Containers

slide-5
SLIDE 5

Track, Schedule and Operationalize

Enter Kubernetes

3

Execute Across Different Hardware Create Efficient Cluster Utilization

slide-6
SLIDE 6

Today, 60% of Those Who Deploy Containers Use K8s for Orchestration*

4

*CNCF

slide-7
SLIDE 7

Now let’s talk about AI

slide-8
SLIDE 8

6

Manual Engineering Classical Machine Learning

Computing Power Fuels Development of AI

Deep Learning

slide-9
SLIDE 9

7

Artificial Intelligence is a Completely Different Ballgame

Experimentation R&D New accelerators Distributed computing

slide-10
SLIDE 10

Constant hassles

8

Data Science Workflows and Hardware Accelerators are Highly Coupled

Data scientists Hardware accelerators

Workflow Limitations Under-utilized GPUs

slide-11
SLIDE 11

This Leads to Frustration on Both Sides

9

Data Scientists are frustrated – speed and productivity are low IT leaders are frustrated – GPU utilization is low

slide-12
SLIDE 12

Container ecosystem for Data Science is growing

AI Workloads are Also Built on Containers

10

NGC – Nvidia pre-trained models for AI experimentation on docker containers

slide-13
SLIDE 13

How Can We Bridge The Divide?

11

slide-14
SLIDE 14

12

Kubernetes, the “De-facto” Standard for Container Orchestration

Multiple queues Automatic queueing/de-queueing Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads

Lacks the following capabilities:

slide-15
SLIDE 15

13

Build Training

How is Experimentation Different?

slide-16
SLIDE 16

14

Build Training

Distinguishing Between Build and Training Workflows

  • Development & debugging
  • Interactive sessions
  • Short cycles
  • Performance is less important
  • Low GPU utilization
slide-17
SLIDE 17

15

Build Training

Distinguishing Between Build and Training Workflows

  • Development & debugging
  • Interactive sessions
  • Short cycles
  • Performance is less important
  • Low GPU utilization
  • Training & HPO
  • Remote execution
  • Long workloads
  • Throughput is highly important
  • High GPU utilization
slide-18
SLIDE 18

16

Fixed quotas Guaranteed quotas How to Solve? Guaranteed Quotas

  • Fits build workloads
  • GPUs are always available
  • Fits training workflows
  • Users can go over quota
slide-19
SLIDE 19

17

Fixed quotas Guaranteed quotas Solution: Guaranteed Quotas

  • Fits build workloads
  • GPUs are always available
  • Fits training workflows
  • Users can go over quota
  • More concurrent experiments
  • More multi-GPU training
slide-20
SLIDE 20

18

Queueing Management Mechanism

slide-21
SLIDE 21

Run:AI - Stitching it All Together

slide-22
SLIDE 22

Run:AI - Applying HPC Concepts to Kubernetes

20

With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap

Data Science teams gain productivity and speed IT teams gain visibility and maximal GPU utilization

slide-23
SLIDE 23

21

Run:AI - Kubernetes-Based Abstraction Layer

INTEGRABLE Easily integrates with IT and Data Science platforms MULTI-CLOUD Run on any public, private and hybrid cloud environment IT GOVERNANCE Policy based orchestration and queuing management

slide-24
SLIDE 24

22

Run:AI

Utilize Kubernetes across IT to improve resource utilization Speed up experimentation process and time to market Easily scale infrastructure to meet needs of the business

slide-25
SLIDE 25

From 28% to 73% utilization, 2X speed, and $1M savings

23

Challenge

28% AVERAGE GPU UTILIZATION - inefficient and underutilized resources

After implementing Run:AI’s platform Solution

73% AVERAGE GPU UTILIZATION

  • Enabled 2x more experiments to run
  • Saved $1M in additional GPU

expenditures for 2020

slide-26
SLIDE 26

24

Run:AI at-a-Glance

Venture Funded

  • Founded in 2018
  • Backed by top VCs
  • Offices in Tel Aviv, New York, and Boston
  • Fortune 500 customers
  • Top cloud and virtualization engineers
slide-27
SLIDE 27

Thank you

slide-28
SLIDE 28

NVMesh in Kubernetes

slide-29
SLIDE 29

What is NVMesh CSI Driver

  • What is NVMesh CSI Driver ?

○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes

  • Main Features

○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs

29

slide-30
SLIDE 30

CSI Driver Components

NVMesh Management

NVMesh CSI Controller

Kubernetes Controller

NVMesh CSI Node Driver NVMesh CSI Node Driver NVMesh CSI Node Driver NVMesh Client NVMesh Client NVMesh Client NVMesh Targets REST API

30

slide-31
SLIDE 31

Dynamic Provisioning & Attach Flow

NVMesh CSI Controller

Kubernetes Controller

NVMesh Management

Create Volume

User creates a Persistent Volume Claim (PVC)

NVMesh Targets

31

slide-32
SLIDE 32

Dynamic Provisioning & Attach Flow

NVMesh CSI Controller

Kubernetes Controller

NVMesh CSI Node Driver NVMesh Client

NVMesh Management

OS mount

User creates a POD that uses the PVC

Attach / Detach

User App PODs /dev/nvmesh/v1 K8s internal mount

POD mount

Nod e

NVMesh Targets

Data

32

slide-33
SLIDE 33

Exposing NVMesh volume in a Pod

kublete/pod2/volumes/v1 /dev/nvmesh/v1 User App POD 1 kubelet/volume/mount kubelet/pod1/volumes/v1 User App POD 2 FileSystem Volume

mount

NVMesh Client

NVMesh attach

Block Volume

bind mount mkfs CSI Publish Volume

For each volume for each POD

CSI Stage Volume

Once for each Volume on the Node

33

slide-34
SLIDE 34

Usage Examples

kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes:

  • ReadWriteMany

volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10 kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg

34

slide-35
SLIDE 35

Summary

NVMesh Benefits for Kubernetes:

  • Persistent storage that scales for stateful applications
  • Predictable application performance – ensure that storage is not a bottleneck
  • Scale your performance and capacity linearly
  • Containers in a pod can access persistent storage presented to that pod, but with the

freedom to restart the pod on an alternate physical node

  • Choice of Kubernetes PVC access mode to match the storage to the application and

file system requirements

35

slide-36
SLIDE 36

William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc.

Machine learning discovery, workflows, and systems on Kubernetes

slide-37
SLIDE 37

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-38
SLIDE 38

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-39
SLIDE 39

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-40
SLIDE 40

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-41
SLIDE 41

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-42
SLIDE 42

codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-43
SLIDE 43

configuration data collection feature extraction process management analysis tools monitoring serving infrastructure machine resource management data verification

(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)

slide-44
SLIDE 44

configuration data collection feature extraction process management analysis tools monitoring serving infrastructure machine resource management data verification

(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)

slide-45
SLIDE 45

data engineers

federate events databases file, object storage transform transform transform archive

slide-46
SLIDE 46

data scientists

federate train models events databases file, object storage developer UI transform transform transform

slide-47
SLIDE 47

application developers

models events databases file, object storage management web and mobile reporting transform transform transform archive federate train developer UI

slide-48
SLIDE 48

data scientists application developers data engineers

models events databases file, object storage management web and mobile reporting developer UI transform transform transform archive train federate

slide-49
SLIDE 49

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-50
SLIDE 50

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-51
SLIDE 51

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-52
SLIDE 52

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-53
SLIDE 53

How Kubernetes can help

slide-54
SLIDE 54

Immutable images

base image configuration and installation recipes user application code

979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e

slide-55
SLIDE 55

Immutable images

base image configuration and installation recipes user application code

979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e

slide-56
SLIDE 56

Immutable images

base image configuration and installation recipes user application code

979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e model in production

  • n 16 July 2019
slide-57
SLIDE 57

Stateless microservices

slide-58
SLIDE 58

Stateless microservices

slide-59
SLIDE 59

Stateless microservices

slide-60
SLIDE 60

Stateless microservices

slide-61
SLIDE 61

Stateless microservices

slide-62
SLIDE 62

Stateless microservices

slide-63
SLIDE 63

Stateless microservices

slide-64
SLIDE 64

Stateless microservices

slide-65
SLIDE 65

Declarative app configuration

https:/ /route.my-awesome-app.ai

slide-66
SLIDE 66

Integration and deployment

slide-67
SLIDE 67

Integration and deployment

OK!

slide-68
SLIDE 68

Integration and deployment

OK!

base image configuration and installation recipes application code application code

slide-69
SLIDE 69

Integration and deployment

base image configuration and installation recipes application code

slide-70
SLIDE 70

Data drift

slide-71
SLIDE 71

Data drift

slide-72
SLIDE 72

On-demand discovery with the 
 Open Data Hub

slide-73
SLIDE 73
slide-74
SLIDE 74
slide-75
SLIDE 75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.13 0.13 0.06 0.07 0.07 0.06 0.02 0.08 0.17 0.11 0.11 0.09 0.04 0.18 0.13 0.04 0.13 0.21 0.14 0.03

*

slide-76
SLIDE 76

more storage sensitive data more CPUs better GPUs

slide-77
SLIDE 77

https:/ /opendatahub.io

slide-78
SLIDE 78

PostgreSQL MariaDB Apache Spark SQL Apache Kafka (via Strimzi) Red Hat Ceph Storage TensorFlow Serving PyTorch Serving Seldon Spark Katib TF Job PyTorch Argo Kubeflow Pipelines OpenShift JupyterHub Apache Superset Grafana Prometheus

slide-79
SLIDE 79

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

slide-80
SLIDE 80

3

feature engineering model training and tuning model validation

2

slide-81
SLIDE 81

codifying problem 
 and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation

OpenShift Pipelines

slide-82
SLIDE 82

codifying problem 
 and metrics model validation data collection and cleaning model deployment monitoring, validation

2 3

OpenShift Pipelines

slide-83
SLIDE 83

REST endpoint OpenShift Serverless

slide-84
SLIDE 84

Further resources

Open Data Hub web site: https:/ /opendatahub.io Contribute: https:/ /github.com/opendatahub-io Get involved: https:/ /gitlab.com/opendatahub/

  • pendatahub-community

ML workflows on OpenShift and Open Data Hub: https:/ /bit.ly/ml-workflows-ocp

slide-85
SLIDE 85
slide-86
SLIDE 86

Thank you!