cuML: A Library for GPU Accelerated Machine Learning Onur Yilmaz, - - PowerPoint PPT Presentation

cuml a library for gpu accelerated machine learning
SMART_READER_LITE
LIVE PREVIEW

cuML: A Library for GPU Accelerated Machine Learning Onur Yilmaz, - - PowerPoint PPT Presentation

cuML: A Library for GPU Accelerated Machine Learning Onur Yilmaz, Ph.D. | oyilmaz@nvidia.com | Senior ML/DL Scientist and Engineer Corey Nolet | cnolet@nvidia.com | Data Scientist and Senior Engineer About Us Onur Yilmaz, Ph.D. Senior ML/DL


slide-1
SLIDE 1

Onur Yilmaz, Ph.D. | oyilmaz@nvidia.com | Senior ML/DL Scientist and Engineer Corey Nolet | cnolet@nvidia.com | Data Scientist and Senior Engineer

cuML: A Library for GPU Accelerated Machine Learning

slide-2
SLIDE 2

2

About Us

Corey Nolet Data Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIA Focuses on building and scaling machine learning algorithms to support extreme data loads at light-speed Over a decade experience building massive-scale exploratory data science & real- time analytics platforms for HPC environments in the defense industry Working towards PhD in Computer Science, focused on unsupervised representation learning Onur Yilmaz, Ph.D. Senior ML/DL Scientist and Engineer on the RAPIDS cuML team at NVIDIA Focuses on building single and multi GPU machine learning algorithms to support extreme data loads at light-speed Ph.D. in computer engineering, focusing on ML for finance.

slide-3
SLIDE 3

3

Agenda

  • Introduction to cuML
  • Architecture Overview
  • cuML Deep Dive
  • Benchmarks
  • cuML Roadmap
slide-4
SLIDE 4

4

Introduction

“Details are confusing. It is only by selection, by elimination, by emphasis, that we get to the real meaning of things.”

~ Georgia O'Keefe

Mother of American Modernism

slide-5
SLIDE 5

5

Realities of Data

slide-6
SLIDE 6

6

Problem

Data sizes continue to grow

slide-7
SLIDE 7

7

Problem

Data sizes continue to grow

slide-8
SLIDE 8

8

Problem

Data sizes continue to grow

min(variance) min(bias)

slide-9
SLIDE 9

9

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

slide-10
SLIDE 10

10

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

slide-11
SLIDE 11

11

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

slide-12
SLIDE 12

12

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

slide-13
SLIDE 13

13

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

slide-14
SLIDE 14

14

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate.

slide-15
SLIDE 15

15

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate.
slide-16
SLIDE 16

16

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.
slide-17
SLIDE 17

17

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.

Iterate some more.

slide-18
SLIDE 18

18

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.

Iterate some more.

Meet reasonable speed vs accuracy tradeoff

slide-19
SLIDE 19

19

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.

Iterate some more.

Meet reasonable speed vs accuracy tradeoff

Time Increases

slide-20
SLIDE 20

20

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.

Iterate some more.

Meet reasonable speed vs accuracy tradeoff

Hours?

Time Increases

slide-21
SLIDE 21

21

Problem

Data sizes continue to grow

Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling

Massive Dataset

Better to start with as much data as possible and explore / preprocess to scale to performance needs.

  • Iterate. Cross Validate & Grid Search.

Iterate some more.

Meet reasonable speed vs accuracy tradeoff

Hours? Days?

Time Increases

slide-22
SLIDE 22

22

ML Workflow Stifles Innovation

It Requires Exploration and Iterations

All Data ETL

Manage Data

Structured Data Store Feature Engineering

Training

Model Training Tuning & Selection

Evaluate

Inference

Deploy

Accelerating just `Model Training` does have benefit but doesn’t address the whole problem

Iterate … Cross Validate … Grid Search … Iterate some more.

slide-23
SLIDE 23

23

ML Workflow Stifles Innovation

It Requires Exploration and Iterations

All Data ETL

Manage Data

Structured Data Store Feature Engineering

Training

Model Training Tuning & Selection

Evaluate

Inference

Deploy

Accelerating just `Model Training` does have benefit but doesn’t address the whole problem End-to-End acceleration is needed

Iterate … Cross Validate … Grid Search … Iterate some more.

slide-24
SLIDE 24

24

Architecture

“More data requires better approaches!”

~ Xavier Amatriain

CTO, CurAI

slide-25
SLIDE 25

25

RAPIDS: OPEN GPU DATA SCIENCE

cuDF, cuML, and cuGraph mimic well-known libraries

Data Preparation Visualization Model Training CUDA PYTHON DASK DL FRAMEWORKS CUDNN RAPIDS CUML CUDF CUGRAPH APACHE ARROW

Pandas-like ScikitLearn-like NetworkX-like

slide-26
SLIDE 26

26

HIGH-LEVEL APIs

CUDA/C++ Multi-Node & Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like

Host 2

GPU1 GPU3 GPU2 GPU4

Host 1

GPU1 GPU3 GPU2 GPU4

Dask-CUML CuML libcuml

slide-27
SLIDE 27

27

cuML API

Python Algorithms Primitives

GPU-accelerated machine learning at every layer

Scikit-learn-like interface for data scientists utilizing cuDF & Numpy CUDA C++ API for developers to utilize accelerated machine learning algorithms. Reusable building blocks for composing machine learning algorithms.

slide-28
SLIDE 28

28

Linear Algebra

Primitives

GPU-accelerated math optimized for feature matrices

Statistics Matrix / Math Random Distance / Metrics Objective Functions More to come!

  • Element-wise operations
  • Matrix multiply
  • Norms
  • Eigen Decomposition
  • SVD/RSVD
  • Transpose
  • QR Decomposition

Sparse Conversions

slide-29
SLIDE 29

29

Algorithms

GPU-accelerated Scikit-Learn

Classification / Regression

Statistical Inference Clustering

Decomposition & Dimensionality Reduction

Timeseries Forecasting Recommendations

Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Kalman Filtering Bayesian Inference Gaussian Mixture Models Hidden Markov Models

K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding ARIMA Holt-Winters Implicit Matrix Factorization

Cross Validation

More to come!

Hyper-parameter Tuning

slide-30
SLIDE 30

30

HIGH-LEVEL APIs

CUDA/C++ Multi-Node / Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like

Host 2

GPU1 GPU3 GPU2 GPU4

Host 1

GPU1 GPU3 GPU2 GPU4

Data Distribution Model Parallelism

slide-31
SLIDE 31

31

HIGH-LEVEL APIs

CUDA/C++ Multi-Node / Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like

Host 2

GPU1 GPU3 GPU2 GPU4

Host 1

GPU1 GPU3 GPU2 GPU4

Data Distribution Model Parallelism

  • Portability
  • Efficiency
  • Speed
slide-32
SLIDE 32

32

Dask cuML

Distributed Data-parallelism Layer

  • Distributed computation scheduler for Python
  • Scales up and out
  • Distributes data across processes
  • Enables model-parallel cuML algorithms
slide-33
SLIDE 33

33

ML Technology Stack

Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA

Dask cuML Dask cuDF cuDF Numpy Thrust Cub cuSolver nvGraph CUTLASS cuSparse cuRand cuBlas

slide-34
SLIDE 34

34

cuML Deep Dive

“I would posit that every scientist is a data scientist.”

~ Arun Subramaniyan

V.P . of Data Science & Analytics, Baker Hughes, a GE Company

slide-35
SLIDE 35

35

Linear Regression (OLS)

Python Layer

cuDF Pandas

slide-36
SLIDE 36

36

Linear Regression (OLS)

Python Layer

cuDF

slide-37
SLIDE 37

37

Linear Regression (OLS)

Python Layer

cuML Scikit-Learn

slide-38
SLIDE 38

38

Linear Regression (OLS)

Python Layer

cuML Scikit-Learn

slide-39
SLIDE 39

39

Linear Regression (OLS)

Python Layer

cuML Scikit-Learn

slide-40
SLIDE 40

40

Linear Regression (OLS)

cuML Algorithms CUDA C++ Layer

slide-41
SLIDE 41

41

Linear Regression (OLS)

cuML Algorithms CUDA C++ Layer

slide-42
SLIDE 42

42

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-43
SLIDE 43

43

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-44
SLIDE 44

44

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-45
SLIDE 45

45

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-46
SLIDE 46

46

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-47
SLIDE 47

47

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

slide-48
SLIDE 48

48

Linear Regression (OLS)

cuML ML-Prims CUDA C++ Layer

c1 c2 c3

….

cN

Matrix A Vector b

….

slide-49
SLIDE 49

49

Benchmarks

slide-50
SLIDE 50

50

ALGORITHMS

Benchmarked on DGX1

slide-51
SLIDE 51

51

UMAP

Released in 0.6!

slide-52
SLIDE 52

52

cuDF + XGBoost

DGX-2 vs Scale Out CPU Cluster

  • Full end to end pipeline
  • Leveraging Dask + cuDF
  • Store each GPU results in sys mem then read back in
  • Arrow to Dmatrix (CSR) for XGBoost
slide-53
SLIDE 53

53

cuDF + XGBoost

Scale Out GPU Cluster vs DGX-2

50 100 150 200 250 300 350 5x DGX-1 DGX-2

Chart Title

ETL+CSV (s) ML Prep (s) ML (s)

  • Full end to end pipeline
  • Leveraging Dask for multi-node + cuDF
  • Store each GPU results in sys mem then read back in
  • Arrow to Dmatrix (CSR) for XGBoost
slide-54
SLIDE 54

54

cuDF + XGBoost

Fully In- GPU Benchmarks

  • Full end to end pipeline
  • Leveraging Dask cuDF
  • No Data Prep time all in memory
  • Arrow to Dmatrix (CSR) for XGBoost
slide-55
SLIDE 55

55

XGBoost Multi-node, Multi-GPU Performance

2290 1956 1999 1948 169 157 500 1000 1500 2000 2500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1

Benchmark

200GB CSV dataset; Data preparation includes joins, variable transformations.

CPU Cluster Configuration

CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration

DGX nodes on InfiniBand network

slide-56
SLIDE 56

56

Single Node Multi-GPU

Linear Regression

  • Reduction: 40mins -> 1min
  • Size: 225gb
  • System: DGX2

tSVD

  • Reduction: 1.6hrs-> 1.5min
  • Size: 220gb
  • System: DGX2

Nearest Neighbors

  • Reduction: 4+hrs-> 30sec
  • Size: 128gb
  • System: DGX1

Will be Released in 0.6

slide-57
SLIDE 57

57

Roadmap

“Data science is the fourth pillar of the scientific method!” ~ Jensen Huang

slide-58
SLIDE 58

58

CUML

Single GPU and XGBoost

cuML SG MG MGMN Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition

slide-59
SLIDE 59

59

DASK-CUML

OLS, tSVD, and KNN in RAPIDS 0.6

cuML SG MG MGMN Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition

slide-60
SLIDE 60

60

DASK-CUML

K-Means*, DBSCAN & PCA in RAPIDS 0.7/0.8

cuML SG MG MGMN Gradient Boosted Decision Trees (GBDT) GLM Logistic Regression Random Forest (regression) K-Means K-NN DBSCAN UMAP ARIMA Kalman Filter Holts-Winters Principal Components Singular Value Decomposition

  • Deprecating the current K-means in 0.6 for new K-means built on MLPrims
slide-61
SLIDE 61

61

CuML 0.6

New Algorithms

  • Stochastic Gradient Descent [Single GPU]
  • UMAP [Single GPU]
  • Linear Regression (OLS) [Single Node, Multi-GPU]
  • Truncated SVD [Single Node, Multi-GPU]

Will be released with RAPIDS 0.6 on Friday!

Notable Improvements

  • Exposing support for hyperparsmeter tuning
  • Removing external requirement on FAISS
  • Lowered Nearest Neighbors memory requirement
slide-62
SLIDE 62

Thank you!

https://rapids.ai https://github.com/cuml https://github.com/dask-cuml Corey Nolet: @cjnolet Onur Yilmaz: @Onur02128993