AI WEBINAR
Date/Time: Tuesday, June 9 | 9 am PST
Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR - - PowerPoint PPT Presentation
Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST Whats next in technology and innovation? Whats next in technology and innovation? Kubernetes & AI Kubernetes & AI with
Date/Time: Tuesday, June 9 | 9 am PST
Presenter: Omri Geller CEO & Co-Founder Your Host: Tom Leyden VP Marketing AI WEBINAR
What’s next in technology and innovation?
Presenter: William Benton Engineering Manager Presenter: Gil Vitzinger Software Developer Presenter: Omri Geller CEO & Co-Founder Your Host: Tom Leyden VP Marketing AI WEBINAR
What’s next in technology and innovation?
Presenter: William Benton Engineering Manager Presenter: Gil Vitzinger Software Developer
Omri Geller, CEO and co-founder, Run:AI
A Bit of History
2
Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration
Bare Metal
Needed flexibility and better utilization
Virtual Machines
Reproducibility and portability
Containers
Track, Schedule and Operationalize
Enter Kubernetes
3
Execute Across Different Hardware Create Efficient Cluster Utilization
Today, 60% of Those Who Deploy Containers Use K8s for Orchestration*
4
*CNCF
6
Manual Engineering Classical Machine Learning
Computing Power Fuels Development of AI
Deep Learning
7
Artificial Intelligence is a Completely Different Ballgame
Experimentation R&D New accelerators Distributed computing
Constant hassles
8
Data Science Workflows and Hardware Accelerators are Highly Coupled
Data scientists Hardware accelerators
Workflow Limitations Under-utilized GPUs
This Leads to Frustration on Both Sides
9
Data Scientists are frustrated – speed and productivity are low IT leaders are frustrated – GPU utilization is low
Container ecosystem for Data Science is growing
AI Workloads are Also Built on Containers
10
NGC – Nvidia pre-trained models for AI experimentation on docker containers
How Can We Bridge The Divide?
11
12
Kubernetes, the “De-facto” Standard for Container Orchestration
Multiple queues Automatic queueing/de-queueing Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads
Lacks the following capabilities:
13
Build Training
How is Experimentation Different?
14
Build Training
Distinguishing Between Build and Training Workflows
15
Build Training
Distinguishing Between Build and Training Workflows
16
Fixed quotas Guaranteed quotas How to Solve? Guaranteed Quotas
17
Fixed quotas Guaranteed quotas Solution: Guaranteed Quotas
18
Queueing Management Mechanism
Run:AI - Stitching it All Together
Run:AI - Applying HPC Concepts to Kubernetes
20
With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap
Data Science teams gain productivity and speed IT teams gain visibility and maximal GPU utilization
21
Run:AI - Kubernetes-Based Abstraction Layer
INTEGRABLE Easily integrates with IT and Data Science platforms MULTI-CLOUD Run on any public, private and hybrid cloud environment IT GOVERNANCE Policy based orchestration and queuing management
22
Utilize Kubernetes across IT to improve resource utilization Speed up experimentation process and time to market Easily scale infrastructure to meet needs of the business
From 28% to 73% utilization, 2X speed, and $1M savings
23
Challenge
28% AVERAGE GPU UTILIZATION - inefficient and underutilized resources
After implementing Run:AI’s platform Solution
73% AVERAGE GPU UTILIZATION
expenditures for 2020
24
Run:AI at-a-Glance
Venture Funded
○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes
○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs
29
NVMesh Management
NVMesh CSI Controller
Kubernetes Controller
NVMesh CSI Node Driver NVMesh CSI Node Driver NVMesh CSI Node Driver NVMesh Client NVMesh Client NVMesh Client NVMesh Targets REST API
30
NVMesh CSI Controller
Kubernetes Controller
NVMesh Management
Create Volume
User creates a Persistent Volume Claim (PVC)
NVMesh Targets
31
NVMesh CSI Controller
Kubernetes Controller
NVMesh CSI Node Driver NVMesh Client
NVMesh Management
OS mount
User creates a POD that uses the PVC
Attach / Detach
User App PODs /dev/nvmesh/v1 K8s internal mount
POD mount
Nod e
NVMesh Targets
Data
32
kublete/pod2/volumes/v1 /dev/nvmesh/v1 User App POD 1 kubelet/volume/mount kubelet/pod1/volumes/v1 User App POD 2 FileSystem Volume
mount
NVMesh Client
NVMesh attach
Block Volume
bind mount mkfs CSI Publish Volume
For each volume for each POD
CSI Stage Volume
Once for each Volume on the Node
33
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes:
volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10 kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg
34
NVMesh Benefits for Kubernetes:
freedom to restart the pod on an alternate physical node
file system requirements
35
William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc.
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
configuration data collection feature extraction process management analysis tools monitoring serving infrastructure machine resource management data verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
configuration data collection feature extraction process management analysis tools monitoring serving infrastructure machine resource management data verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
federate events databases file, object storage transform transform transform archive
federate train models events databases file, object storage developer UI transform transform transform
models events databases file, object storage management web and mobile reporting transform transform transform archive federate train developer UI
models events databases file, object storage management web and mobile reporting developer UI transform transform transform archive train federate
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e
979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e
979229b9 33721112 e8cae4f6 2bb6ab16 a8296f7e a6afd91e 6b8cad3e model in production
https:/ /route.my-awesome-app.ai
base image configuration and installation recipes application code application code
base image configuration and installation recipes application code
*
more storage sensitive data more CPUs better GPUs
PostgreSQL MariaDB Apache Spark SQL Apache Kafka (via Strimzi) Red Hat Ceph Storage TensorFlow Serving PyTorch Serving Seldon Spark Katib TF Job PyTorch Argo Kubeflow Pipelines OpenShift JupyterHub Apache Superset Grafana Prometheus
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
feature engineering model training and tuning model validation
codifying problem and metrics feature engineering model training and tuning model validation data collection and cleaning model deployment monitoring, validation
OpenShift Pipelines
codifying problem and metrics model validation data collection and cleaning model deployment monitoring, validation
OpenShift Pipelines
REST endpoint OpenShift Serverless