Classical Machine Learning At Scale Thomas Parnell Research Staff - PowerPoint PPT Presentation

Classical Machine Learning At Scale Thomas Parnell Research Staff Member, Data and AI Systems IBM Research - Zurich

Motivation 1. Why do classical machine learning models dominate in many applications? 2. Which classical machine learning workloads might benefit from being deployed in HPC-like environments. Source: Kaggle Data Science Survey, November 2019 2

Why is classical ML still popular? § Deep neural networks dominate machine learning research, and have achieved state-of- the-art accuracy on a number of different tasks. – Image Classification – Natural language processing – Speech recognition § However, in many industries such as finance and retail, classical machine learning techniques are still widely used in production. Why? § The reason is primarily th the data ta its tself . § Rather than images, natural language or speech, real world data often looks like…. 3

Tabular Data Source: https://towardsdatascience.com/encoding-categorical-features-21a2651a065c § Datasets have a ta tabular str tructu ture and contain a a lot of cat categ egorical cal var ariabl ables es. § DNNs require feature engineering / embeddings. § Whereas, a number of classical ML models can deal with them “out of the box”. 4

Classical ML Models GLMs, Trees, Forests and Boosting Machines

Generalized Linear Models Pr Pros: ü Simple and fast. Support ü Scale well to huge datasets. Ridge Vector Regression Machines ü Easy to interpret. Regression ü Very few hyper-parameters. Classification Generalized Linear Models Con Cons: Lasso Regression x Cannot learn non-linear Logistic Regression relationships between features. x Require extensive feature engineering.

Decision Trees Pros: Pr Age > 30 ü Simple and fast. YES NO ü Easy to interpret. ü Capture non-linear relationships +1 Zip Code between features. == 8050 ü Native support for categorical variables. NO YES Cons: Con x Greedy training algorithm. -1 +1 x Can easily overfit the training data.

Random Forests Pr Pros: ü Inherits most benefits of decision trees. ü Improve generalization via bootstrap sampling + averaging. ü Embarrassingly parallel. Con Cons: x Somewhat heuristic. x Computationally intense. x Harder to interpret. Source: https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

Gradient Boosting Machines Pr Pros: ü Inherits most benefits of decision trees. ü State-of-the-art generalization. ü Theoretically elegant training algorithm. Con Cons: x Computationally intense. x Inherently sequential. x Harder to interpret. Source: https://campus.datacamp.com/courses/machine- learning-with-tree-based-models-in-python/boosting?ex=5 x A lot of hyper-parameters.

Distributed Training Data Parallel vs. Model Parallel

Why scale-out? Very huge data (e.g. 1+ TBs) 1. Dataset does not fit inside the memory of a single machine. - The dataset may be stored in a distributed filesystem. - Data-parallel training algorithms are a ne necessit ity , even for relatively simple linear models. - Training acceleration 2. Dataset may fit inside the memory of a single node. - However, model may be very complex (e.g. RF with 10k trees) - We ch choose to scale-out using model-parallel algorithms to accelerate training. - We will now consider two examples of the above scenarios. 11

Training GLMs on Big Data § Training GLMs involves solving an optimization of the following form: 𝑛𝑗𝑜 ) 𝑔 𝐵𝛽 + + 𝑕 % (𝛽 % ) % § Where 𝛽 denotes the model we would like to learn, 𝐵 denotes the data matrix and 𝑔 and 𝑕 % denote convex functions specifying the loss and regularization, respectively. § We assume that data matrix A is partitioned across a set of worker machines. § One way to solve the above is to use the standard min mini-bat batch ch stoch chas astic c gradi adien ent des descen cent GD) widely used in the deep learning field. (S (SGD § However, since the cost of computing gradients for linear models is typically relatively cheap relative to the cost of communication over the network, mini-batch SGD often performs poorly.

CoCoA Framework § Let us assume that the data matrix A is partitioned across workers by column (feature). § The CoCoA framework (Smith et al. 2018) define a data-local subproblem: min ) [7] ℱ / 𝐵 [/] , 𝛽 [/] , 𝑤 § Each worker solves its local subproblem with respect to its local model coordinates 𝛽 [/] § This subproblem depends on ocal data 𝑩 [𝒍] as well as so state 𝒘 . only on on the loc some sh shared st § An arbitrary algorithm can be used to solve the sub-problem in an approximate way. § The shared state is then updated across all workers, and the process repeats. § This method is theoretically guaranteed to converge to the optimal solution and allows one to trade-off the ratio of computation vs. communication much more effectively.

Distributed Training using CoCoA Worker 0 Worker 1 Worker 2 Worker 3 Data Partition 0 Data Partition 1 Data Partition 2 Data Partition 3 Local Solver Local Solver Local Solver Local Solver 𝛽 [;] 𝑤 (;) 𝛽 [<] 𝑤 (<) 𝛽 [=] 𝑤 (=) 𝛽 [>] 𝑤 (>) AllReduce AllReduce AllReduce AllReduce 𝑤 𝑤 𝑤 𝑤 Local Solver Local Solver Local Solver Local Solver … … … …

Duality § Many GLMs admit two equivalent representations: primal and dual. P § CoCoA can be applied to either. !(#) § Pr Prima imal case: – Partition the data by column (feature) ! ∗ – 𝛽 has dimension m – 𝑤 has dimension n %(&(#)) – Mi Minimal co communicat cation wh when m >> >> n D § Dua Dual case: – Partition the data by row (example) – 𝛽 has dimension n – 𝑤 has dimension m – Mi Minimal communication when n n >> >> m

Real Example Dataset: Criteo TB Click Logs (4 billion examples) Da Model : Mo Logistic Regression 0.133 Vowpal Wabbit Mini-batch SGD [12 cores] Spark Mllib 0.132 [512 cores] LogLoss (Test) 0.131 TensorFlow on Spark [12 executors] CoCoA TensorFlow 0.13 [16 V100 GPUs] TensorFlow Snap ML [60 worker machines, LIBLINEAR [16 V100 GPUs] 0.129 29 parameter machines] [1 core] 0.128 1 10 100 1000 10000 Training Time (minutes) Snap ML (Dünner et al. 2018) uses a variant of CoCoA + new algorithms for effectively utilizing GPUs + efficient MPI implementation.

Model-parallel Random Forests § Scenario: the dataset fits in memory of a single node. § We wish to build a very large forest of trees (e.g. 4000). § Replicate the training dataset across the cluster. § Each worker builds a partition of the trees, in parallel § Embarrassingly parallel, ex expect pect linear ear speed peed-up up for la large eno noug ugh h models ls . Worker 0 Worker 1 Worker 2 Worker 3 Dataset Dataset Dataset Dataset Trees 0-999 Trees 1000-1999 Trees 2000-2999 Trees 3000-3999 17

Scaling Example

Distributed Tree Building § What if dataset is too large to fit in memory of a single node? § Partition dataset across workers in the cluster. § Build each tree in the forest in a distributed way. § Tree building requires a a lot of co communicat cation, scal cales es badl badly. § Can we do something truly data parallel? Worker 0 Worker 1 Worker 2 Worker 3 Data Partition 0 Data Partition 1 Data Partition 2 Data Partition 3 Build Tree 0 Build Tree 0 Build Tree 0 Build Tree 0 Build Tree 1 Build Tree 1 Build Tree 1 Build Tree 1 19

Data-parallel + model-parallel Random Forest § In a random forest, each tree is trained on a bootstrap sample of the training data. § What if we relax this constraint? Instead, we could train each tree on a random partition. § We can thus randomly partition the data across the workers in the cluster. § And then train a partition of the trees independently on each worker on a partition of the data. § This approach can achieve su super-line linear scaling ling , possibly at the expense of accuracy. Worker 0 Worker 1 Worker 2 Worker 3 Data Partition 0 Data Partition 1 Data Partition 2 Data Partition 3 Trees 0-999 Trees 1000-1999 Trees 2000-2999 Trees 3000-3999 20

Accuracy Trade-Off Da Dataset: Rossmann Store Sales (800k examples, 20 features) Mo Model: Random Forest, 100 trees, depth 8, 10 repetitions Accuracy degrades quickl kly as as w we ap e approach ~ ach ~100 pa part rtition ons Ac Accuracy degrades fairly sl slowly y up to ~10 partitions s

Hyper-parameter tuning Random Search, Successive Halving and Hyperband

Hyper-parameter Tuning § GBM-like models have a large number of hyper-parameters: – Number of boosting rounds. – Learning rate. – Subsampling (example and feature) rates. – Maximum tree depth. – Regularization penalties. § Standard approach is to split training set into an effective training set and a validation set. § The validation set is used to evaluate the accuracy for different choices of hyper-parameters. § Many different algorithms exist for hyper-parameter tuning (HPT). § However, all involve evaluating a large number (e.g. 1000s) of configurations. à HP HPT T can lead to o HP HPC-scale workl kloads even for relatively small datasets. § We will now introduce 3 HPT methods that are well-suited for HPC environments. 23

Classical Machine Learning At Scale Thomas Parnell Research Staff - PowerPoint PPT Presentation

Classical Machine Learning At Scale Thomas Parnell Research Staff Member, Data and AI Systems IBM Research - Zurich Motivation 1. Why do classical machine learning models dominate in many applications? 2. Which classical machine learning

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Optimal Quantum Sample Complexity of Learning Algorithms Srinivasan Arunachalam (Joint work with

Classical Conditioning MacFarlane (1978) Perceptual Development: Methods Classical Conditioning

Decline of classical economics and the rise of neoclassical economics From 1870s on, classical

Non-Classical Logics Viorica Sofronie-Stokkermans E-mail: sofronie@uni-koblenz.de Winter

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Embedded Actors Towards Distributed Programming in the IoT Raphael Hiesgen, Dominik

Parallel computations of Grbner bases in the Weyl algebra Something to run on a machine with

CCL/Cocoa Interfaces Paul Krueger, Ph.D. November, 2009 Interface Goals: Looks native to

Nigeria and the Challenge of Energy By Emmanuel Ohieku Jonah Federal Republic of Nigeria

iPhone Privacy Nicolas Seriot Black Hat DC 2010 Arlington, Virginia, USA http://seriot.ch

6. Early Modern Europe 6.1. Portugal: Prince Henry & Marco Polo 6.2. Columbus and the New

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic,

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic,

Classical Machine Learning At Scale Thomas Parnell Research Staff - PowerPoint PPT Presentation

Classical Machine Learning At Scale Thomas Parnell Research Staff Member, Data and AI Systems IBM Research - Zurich Motivation 1. Why do classical machine learning models dominate in many applications? 2. Which classical machine learning

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Optimal Quantum Sample Complexity of Learning Algorithms Srinivasan Arunachalam (Joint work with

Classical Conditioning MacFarlane (1978) Perceptual Development: Methods Classical Conditioning

Decline of classical economics and the rise of neoclassical economics From 1870s on, classical

Non-Classical Logics Viorica Sofronie-Stokkermans E-mail: sofronie@uni-koblenz.de Winter

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Embedded Actors Towards Distributed Programming in the IoT Raphael Hiesgen, Dominik

Parallel computations of Grbner bases in the Weyl algebra Something to run on a machine with

CCL/Cocoa Interfaces Paul Krueger, Ph.D. November, 2009 Interface Goals: Looks native to

Nigeria and the Challenge of Energy By Emmanuel Ohieku Jonah Federal Republic of Nigeria

iPhone Privacy Nicolas Seriot Black Hat DC 2010 Arlington, Virginia, USA http://seriot.ch

6. Early Modern Europe 6.1. Portugal: Prince Henry &amp; Marco Polo 6.2. Columbus and the New

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic,

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic,

6. Early Modern Europe 6.1. Portugal: Prince Henry & Marco Polo 6.2. Columbus and the New