Heterogeneous Compute Environments Celestine Dnner Martin Jaggi - PowerPoint PPT Presentation

High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research)

Motivation of Our Work Distributed Training of Large-Scale Linear Models  fast training  interpretable models  training on large-scale datasets

Motivation of Our Work Distributed Training of Large-Scale Linear Models Choose an Algorithm Choose an Implementation Choose an Infrastructure ? How does the infrastructure and the implementation impact the performance of the algorithm?

Motivation of Our Work Distributed Training of Large-Scale Linear Models Choose an Algorithm Choose an Implementation Choose an Infrastructure ? How can algorithms be optimized and implemented to achieve optimal performance on a given system?

Algorithmic Challenge of Distributed Learning 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) min 𝐱

Algorithmic Challenge of Distributed Learning • The more frequently you exchange information the 1 1 faster your model converges 4 Trade-off 2 • Communication over the aggregate 3 network can be very expensive local models 1 1 1 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) min 𝐱

CoCoA Framework [ SMITH, JAGGI, TAKAC, MA, FORTE, HOFMANN, JORDAN , 2013-2015] Tunable hyper-parameter H H steps of H steps of local solver local solver H* depends on: system • Implementation/ • framework H steps of local solver H steps of local solver H steps of 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) local solver min 𝐱

Implementation: Frameworks for Distributed Computing * http://spark.apache.org/ MPI High-Performance Computing Framework Open Source Cloud Computing Framework Requires advanced system knowledge • Easy-to-use • C++ • Powerful APIs : Python, Scala, Java, R • • Good performance Poorly understood overheads • Designed for different purposes different characteristics Apache Spark, Spark and the Spark logo are trademarks of the Apache Software foundation (ASF). Other product or service names may be trademarks or service of IBM or other companies

webspam dataset Different implementations of CoCoA 8 Spark workers • (A) Spark Reference Implementation* • (B) pySpark Implementation • (C) MPI Implementation Offload local solver to C++ • (A*) Spark+C • (B*) pySpark+C (A*),(B*),(C) execute identical C++ code * https://github.com/gingsmith/cocoa 100 iterations of CoCoA for 𝐼 fixed

webspam dataset Communication-Computation Tradeoff 8 Spark workers 6x Understanding characteristics of the framework and correctly adapting the algorithm can decide upon orders of magnitude in performance!

to the designer strive to design flexible algorithms that can be adapted to system characteristics to the user be aware that machine learning algorithms need to be tuned to achieve good performance C. Dünner, T. Parnell, K. Atasu, M. Sifalakis, H. Pozidis, “ Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark ”, IEEE International Conference on Big Data, Boston, 2017

Which local solver should we use? ? H steps of ? H steps of local solver local solver ? H steps of local solver ? H steps of ? local solver H steps of local solver

Stochastic Primal-Dual Coordinate Descent Methods good convergence properties sequential problem: cannot leverage full power of modern CPUs or GPUs Asynchronous implementations 𝑔 ∗ : strongly-convex 𝑔 : smooth 𝜷 𝑔 ∗ 𝜷 + 𝒙 𝑔 𝐵 ⊤ 𝒙 + ∗ (−𝐵 :𝑗 ⊤ 𝒙) min 𝑕 𝑗 (𝑥 𝑗 ) min 𝑕 𝑗 𝑗 𝑗 𝑀 2 -regularized SVM, Ridge Regression, 2 + 𝜇 1 Ridge Regression, Lasso, min 2 2𝑜 𝐵 ⊤ 𝒙−𝒛 2 2 𝒙 2 𝑀 2 -regularized Logistic Regression, … 𝒙 Logistic Regression…. 2 + 𝜇 𝒙 1 1 min 2𝑜 𝐵 ⊤ 𝒙−𝒛 2 𝒙

𝒙 𝑔 𝐵 ⊤ 𝒙 + Asynchronous Stochastic Algorithms min 𝑕 𝑗 (𝑥 𝑗 ) 𝑗 Parallelized over cores: Every core updates a dedicated subset of coordinates ! Write collision on shared vector Recompute shared vector Liu et al., ” AsySCD ” (JMLR’15) • Memory-locking K . Tran et al., “Scaling up SDCA, (SIGKDD’15 ) • • Live with undefined behavior C.- J. Hsieh et al. “ PASSCoDe :“ (ICML’15)

𝒙 𝑔 𝐵 ⊤ 𝒙 + Asynchronous Stochastic Algorithms min 𝑕 𝑗 (𝑥 𝑗 ) 𝑗 Parallelized over cores: Every core updates a dedicated subset of coordinates ! Write collision on shared vector Recompute shared vector Liu et al., ” AsySCD ” (JMLR’15) • Memory-locking K . Tran et al., “Scaling up SDCA, (SIGKDD’15 ) • • Live with undefined behavior C.- J. Hsieh et al. “ PASSCoDe :“ (ICML’15) Ridge Regression on webspam

2-level Parallelism of GPUs GPU Thread Block i SM1 SM2 SM3 SM4 … SM5 SM6 SM7 SM8 Main Memory Shared memory 1 st level of parallelism: A GPU consists of streaming multiprocessors • Thread blocks get assigned to multi-processors and are executed asynchronously • 2 nd level of parallelism Each thread block consists of up to 1024 threads • Threads are grouped into warps (32 threads) which are executed as SMDI operations •

webspam dataset GPU: GeForce GTX 1080Ti GPU Acceleration CPU: 8-core Intel Xeon E5 A T wice P arallel A synchronous S tochastic C oordinate D escent (TPA-SCD) Algorithm 1. thread-blocks are executed in parallel, asynchronously updating one coordinate each. 10x 8x 2. Update computation within a thread block is interleaved to ensure memory locality within warp and local memory is used to accumulate partial sums 3. Atomic add functionality of modern GPUs is used to update shared vector SVM Ridge T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis , „ Large-Scale Stochastic Learning Using GPUs “, IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lake Buena Vista, FL, 2017

Which local solver should we use? ? H steps of ? It depends on the H steps of local solver available hardware local solver ? H steps of local solver ? H steps of ? local solver H steps of local solver

Heterogeneous System GPU CPU core core core core core core

GPU CPU Du al H eterogeneous L earning [NIPS’17] core core core core memo core core A scheme to efficiently use Limited-Memory Accelerators for Linear Learning memo Idea: The GPU should work on the part of the data it can learn most from. Contribution of individual data columns to the duality gap is indicative for their potential to improving the model Select coordinate j with largest duality gap Lasso epsilon dataset

GPU CPU Du al H eterogeneous L earning [NIPS’17] core core core core memo core core A scheme to efficiently use Limited-Memory Accelerators for Linear Learning memo Idea: The GPU should work on the part of the data it can learn most from. Contribution of individual data columns to the duality gap is indicative for their potential to improving the model Duality Gap Computation is expensive! ∗ (𝒚 𝑘 ⊤ 𝒘) 𝐻𝑏𝑞 𝒙 = 𝑘 𝑥 𝑘 𝒚 𝑘 , 𝒘 + 𝑕 𝑘 𝑥 𝑘 + 𝑕 𝑘 • we introduce a gap-memory • Parallelization of workload between CPU and GPU GPU runs algorithm on subset of the data CPU computes importance values

DuHL Algorithm GPU CPU C. Dünner, T. Parnell, M. Jaggi, „ Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems “, NIPS, Long Beach, CA, 2017

ImageNet dataset 30GB DuHL: Performance Results GPU : NVIDIA Quadra M4000 (8GB) CPU : 8-core Intel Xeon X86 (64GB) Lasso 10 −1 Lasso 10 −2 suboptimality # swaps 10 −3 0 20 40 60 80 100 iterations 10 −4 Fig 2: I/O efficiency of DuHL 10 −5 0 20 40 60 80 100 iterations Reduced I/O cost and faster convergence accumulate to 10x speedup Fig 1: Superior convergence properties of DuHL over existing schemes

ImageNet dataset 30GB DuHL: Performance Results GPU : NVIDIA Quadra M4000 (8GB) CPU : 8-core Intel Xeon X86 (64GB) Lasso SVM 10 −1 10 −1 10 −2 suboptimality 10 −2 dualitygap 10 −3 10 −3 10 −4 10 −4 10 −5 10 −5 0 100 200 300 0 100 200 300 time [s] time [s] Reduced I/O cost and faster convergence accumulate to 10x speedup

Combining it all A library for ultra-fast machine learning Goal: remove training as a bottleneck Enable seamless retraining of models • • Enable agile development Enable training on large-scale datasets • Enable high quality insights • - Exploit Primal-Dual Structure of ML problems to minimize communication - Offer GPU acceleration - Implement DuHL for efficient utilization of limited-memory accelerators - Improved memory management of Spark

Heterogeneous Compute Environments Celestine Dnner Martin Jaggi - PowerPoint PPT Presentation

High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dnner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research) Motivation of Our Work

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

misc: environments, usethis, package structure Environments Environments and bindings via

Environments Announcements Environments for Higher-Order Functions Environments Enable

1 1 easy to compute , 1 easy to compute 2

Integrodifference equations for invasive species in heterogeneous environments F . Lutscher

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Environments Liam OConnor CSE, UNSW (and data61) Term 3 2019 1 Environments Closures

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

RSA and Public Key Cryptography Chester Rebeiro IIT Madras CR CR STINSON : chapter 5, 6

Prospects of Prospects of combined measurements of combined measurements of Higgs

Objec&ves Dynamic Programming Review Knapsack Sequence Alignment Mar 28, 2018

VARNA, JUNE 3-8, 2011 http://neondude.uw.hu/ esher_mosaic_ii.jpg Based on: V. F., D.Kubiznak,

Adap%ve(Methods(for(User1Centered( Organiza%on(of(Music(Collec%ons(

Tight sets in finite geometry Jan De Beule Department of Mathematics Ghent University March

Algebraic techniques in finite geometry: a case study J. De Beule A. Gcs Department of Pure

Primitive permutation groups and generalised quadrangles Tomasz Popiel (QMUL & UWA) Joint

Heterogeneous Compute Environments Celestine Dnner Martin Jaggi - PowerPoint PPT Presentation

High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dnner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research) Motivation of Our Work

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

misc: environments, usethis, package structure Environments Environments and bindings via

Environments Announcements Environments for Higher-Order Functions Environments Enable

1 1 easy to compute , 1 easy to compute 2

Integrodifference equations for invasive species in heterogeneous environments F . Lutscher

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Environments Liam OConnor CSE, UNSW (and data61) Term 3 2019 1 Environments Closures

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

RSA and Public Key Cryptography Chester Rebeiro IIT Madras CR CR STINSON : chapter 5, 6

Prospects of Prospects of combined measurements of combined measurements of Higgs

Objec&amp;ves Dynamic Programming Review Knapsack Sequence Alignment Mar 28, 2018

VARNA, JUNE 3-8, 2011 http://neondude.uw.hu/ esher_mosaic_ii.jpg Based on: V. F., D.Kubiznak,

Adap%ve(Methods(for(User1Centered( Organiza%on(of(Music(Collec%ons(

Tight sets in finite geometry Jan De Beule Department of Mathematics Ghent University March

Algebraic techniques in finite geometry: a case study J. De Beule A. Gcs Department of Pure

Primitive permutation groups and generalised quadrangles Tomasz Popiel (QMUL &amp; UWA) Joint

Objec&ves Dynamic Programming Review Knapsack Sequence Alignment Mar 28, 2018

Primitive permutation groups and generalised quadrangles Tomasz Popiel (QMUL & UWA) Joint