From inception to insight: Accelerating AI productivity with GPUs - - PowerPoint PPT Presentation

from inception to insight accelerating
SMART_READER_LITE
LIVE PREVIEW

From inception to insight: Accelerating AI productivity with GPUs - - PowerPoint PPT Presentation

From inception to insight: Accelerating AI productivity with GPUs John Zedlewski, Director, RAPIDS Machine Learning @ NVIDIA Ramesh Radhakrishnan, Technologist @ Server OCTO, Dell EMC Data sizes continue to grow Prototyping and production


slide-1
SLIDE 1

John Zedlewski, Director, RAPIDS Machine Learning @ NVIDIA Ramesh Radhakrishnan, Technologist @ Server OCTO, Dell EMC

From inception to insight: Accelerating AI productivity with GPUs

slide-2
SLIDE 2

2

Data sizes continue to grow

Prototyping and production diverge

Large-scale cluster

Spark, Hadoop High throughput Full data

Workstation

Python Fast iteration Small data subset

Challenges!

“Tools gap” - rewriting Python or R code to Spark/Hadoop jobs to scale to cluster High latency on cluster leads to slower iteration Small data subsets on workstation make it hard to build realistic models

RAPIDS on GPU

RAPIDS + Dask Consistent tools: Workstation or Cluster High throughput / low latency Full data or large subsets

slide-3
SLIDE 3

3

Data Processing Evolution

Faster data access, less data movement

25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read

Query ETL ML Train

HDFS Read

Query ETL ML Train

HDFS Read

GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train

5-10x Improvement More code Language rigid Substantially on GPU

Traditional GPU Processing Hadoop Processing, Reading from disk Spark In-Memory Processing

slide-4
SLIDE 4

4

APP A

Data Movement and Transformation Data Movement and Transformation

The bane of productivity and performance

CPU

APP B Copy & Convert Copy & Convert Copy & Convert APP A

GPU Data

APP B

GPU Data

Read Data Load Data APP B APP A

GPU

slide-5
SLIDE 5

5

Data Movement and Transformation Data Movement and Transformation

What if we could keep data on the GPU?

APP A APP B Copy & Convert Copy & Convert Copy & Convert Read Data Load Data

CPU

APP A

GPU Data

APP B

GPU Data

APP B APP A

GPU

Copy & Convert

slide-6
SLIDE 6

6

Data Processing Evolution

Faster data access, less data movement

25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read

Query ETL ML Train

HDFS Read

Query ETL ML Train

HDFS Read

GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train

Arrow Read

ETL ML Train

5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU

RAPIDS Traditional GPU Processing Hadoop Processing, Reading from disk Spark In-Memory Processing

Query

slide-7
SLIDE 7

7

RAPIDS

Scale up and out with accelerated GPU data science

cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization Dask

slide-8
SLIDE 8

8

RAPIDS

Scale up and out with accelerated GPU data science

cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization Dask

Pandas API sklearn API NetworkX API

slide-9
SLIDE 9

9

Scale up with RAPIDS

Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data

PyData

Scale Up / Accelerate

slide-10
SLIDE 10

10

Scale out with RAPIDS + Dask with OpenUCX

Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba

RAPIDS and Others

Multi-GPU On single Node (DGX) Or across a cluster

RAPIDS + Dask with OpenUCX

Scale Up / Accelerate Scale out / Parallelize

NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data

PyData

Multi-core and Distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures

Dask

slide-11
SLIDE 11

11

Faster Speeds, Real-World Benefits

cuIO/cuDF – Load and Data Preparation cuML - XGBoost Time in seconds (shorter is better)

cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost

Benchmark

200GB CSV dataset; Data prep includes joins, variable transformations

CPU Cluster Configuration

CPU nodes (61 GiB memory, 8 vCPUs, 64- bit platform), Apache Spark

DGX Cluster Configuration

5x DGX-1 on InfiniBand network

8762 6148 3925 3221 322 213

End-to-End

slide-12
SLIDE 12

12

Dask + cuML

Machine Learning at Scale

slide-13
SLIDE 13

13

GPU Memory Data Preparation Visualization Model Training Dask

Machine Learning

More models more problems

cuDF cuIO Analytics cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization

slide-14
SLIDE 14

14

Algorithms

GPU-accelerated Scikit-Learn

Classification / Regression Inference Clustering Decomposition & Dimensionality Reduction Time Series Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Random forest / GBDT inference K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding Holt-Winters Kalman Filtering Cross Validation

More to come!

Hyper-parameter Tuning

slide-15
SLIDE 15

15

RAPIDS matches common Python APIs

from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X)

Find Clusters

from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})

GPU-Accelerated Clustering

slide-16
SLIDE 16

16

Benchmarks: single-GPU cuML vs scikit-learn

1x V100 vs 2x 20 core CPU

slide-17
SLIDE 17

17

Why Dask?

  • PyData Native
  • Built on top of NumPy, Pandas Scikit-Learn, etc. (easy to migrate)
  • With the same APIs (easy to train)
  • With the same developer community (well trusted)
  • Scales
  • Easy to install and use on a laptop
  • Scales out to thousand-node clusters
  • Popular
  • Most common parallelism framework today at PyData and SciPy

conferences

  • Deployable
  • HPC: SLURM, PBS, LSF, SGE
  • Cloud: Kubernetes
  • Hadoop/Spark: Yarn
slide-18
SLIDE 18

18

import dask_cudf df = dask_cudf.read_csv(“data-*.csv”) df.groupby(df.user_id).value.mean().compute()

Using Dask in Practice

Familiar Python APIs

import pandas as pd df = pd.read_csv(“data-*.csv”) df.groupby(df.user_id).value.mean()

Dask supports a variety of data structures and backends

slide-19
SLIDE 19

19

A quick demo...

slide-20
SLIDE 20

20

  • https://ngc.nvidia.com/registry/nvidia-

rapidsai-rapidsai

  • https://hub.docker.com/r/rapidsai/rapidsai/
  • https://github.com/rapidsai
  • https://anaconda.org/rapidsai/

RAPIDS

How do I get the software?

slide-21
SLIDE 21

21

AI is more than Model Training

DISCOVERY EXPLORATION MODELING FINDINGS OPERATIONALIZE

Identify and define business problem Capture business requirements

Source: Adapted from DellTech Data science solutions learnings & presentation

Exploratory data analysis Acquire, prepare & enrich data Preliminary ROI analysis Modeling & performance evaluation Models development Deliver Insights & Recommendations Measure business effectiveness & ROI Promote business enablement Deploy the solution at scale Promote user adoption Continuous model improvement

slide-22
SLIDE 22

22

Restricted - Confidential

A vision for end-to-end ML journey – Fluid and Flexible

Dell Tech AI offerings Cloud to In-house, One vendor with no Lock-in

Public cloud storage Dell EMC Enterprise Storage P A R T N E R S O F T W A R E D AT A P R E P A R AT I O N Elastic cloud storage

Dell Tech consulting services System Integrator partners

Enterprise ISV ML Software Open Source Deep Learning Frameworks Accelerator Virtualization and pooling

Direct & Consultative Sales SOFTWARE PARTNERS ECO-SYSTEM

C L O U D F O U N D A T I O N

Public cloud & distributed edge Ready Bundles for ML and DL Infrastructure Hosted Private cloud Hybrid Cloud Appliances

Private Cloud B O O M I C O N N E C T O R S Public Cloud Hybrid Cloud

slide-23
SLIDE 23

23

AI and Deep Learning Workstation Use Cases

ML

pyTorch

Rapids

NVIDIA GPU CLOUD (NGC) Container Framework Data Science Sandboxes Production Development

Isilon Storage

Precision WS GPU 1-3 x Precision WS GPU 1-3 x

In place access Copy and stage

slide-24
SLIDE 24

24

Pre-verified GPU-accelerated Deep Learning Platforms

Isilon the foundation data lake for AI platforms

Isilon F, H and A nodes as appropriate ECS Object Store

NVIDIA GPU CLOUD (NGC)

Currently pre-verified To be verified in H2 2019

Ready Solutions

Dell PowerEdge Hyper-Converged DGX-1 Server Dell DSS8440 Precision WS DGX-2 Server

ML

pyTorch

Rapids

White papers, best practices, performance tests, dimensioning guidelines

slide-25
SLIDE 25

THANK YOU

Ramesh Radhakrishnan TODO insert email / Twitter John Zedlewski @zstats jzedlewski@nvidia.com

slide-26
SLIDE 26

26

Join the Movement

Everyone can help!

Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed!

APACHE ARROW GPU Open Analytics Initiative

https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @GPUOAI

RAPIDS

https://rapids.ai @RAPIDSAI

Dask

https://dask.org @Dask_dev

slide-27
SLIDE 27

27

cuDF

slide-28
SLIDE 28

28

cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization

RAPIDS

GPU Accelerated data wrangling and feature engineering

Dask

slide-29
SLIDE 29

29

GPU-Accelerated ETL

The average data scientist spends 90+% of their time in ETL as opposed to training models

slide-30
SLIDE 30

30

ETL - the Backbone of Data Science

cuDF is…

Python Library

  • A Python library for manipulating GPU

DataFrames following the Pandas API

  • Python interface to CUDA C++ library with

additional functionality

  • Creating GPU DataFrames from Numpy

arrays, Pandas DataFrames, and PyArrow Tables

  • JIT compilation of User-Defined Functions

(UDFs) using Numba

slide-31
SLIDE 31

31

cuDF v0.8, Pandas 0.23.4 Running on NVIDIA DGX-1: GPU: NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz GPU (cuDF) Speedup vs. CPU (Pandas)

Benchmarks: single-GPU Speedup vs. Pandas

slide-32
SLIDE 32

32

  • Follow Pandas APIs and provide >10x speedup
  • CSV Reader - v0.2, CSV Writer v0.8
  • Parquet Reader – v0.7, Parquet Writer v0.10
  • ORC Reader – v0.7, ORC Writer v0.10
  • JSON Reader - v0.8
  • Avro Reader - v0.9
  • GPU Direct Storage integration in progress for

bypassing PCIe bottlenecks!

  • Key is GPU-accelerating both parsing and

decompression wherever possible

Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats

Eliminate data extraction bottlenecks

cuIO under the hood

slide-33
SLIDE 33

33

cuML

slide-34
SLIDE 34

34

ML Technology Stack

Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA

Dask cuML Dask cuDF cuDF Numpy Thrust Cub cuSolver nvGraph CUTLASS cuSparse cuRand cuBlas