Orpheus: Efficient Distributed Machine Learning via System and - PowerPoint PPT Presentation

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co- design Pengtao Xie (Petuum Inc) Jin Kyu Kim (CMU), Qirong Ho (Petuum Inc), Yaoliang Yu (University of waterloo), Eric P. Xing (Petuum Inc)

Massive Data

Distributed ML Systems Yahoo LDA DistBelief Li & Smola PS Project Adam Parameter Server Systems Bosen GeePS Graph Processing GraphX Pregel Systems Dataflow Systems Hybrid Systems

Matrix-Parameterized Models (MPMs) • Model parameters are represented as a matrix Neural Network " ! Neurons in hidden layer 1 ! Neurons in # !" " hidden layer 2 • Other examples: Topic Model, Multiclass Logistic Regression, Distance Metric Learning, Sparse Coding, Group Lasso, etc.

Parameter Matrices Could Be Very Large LightLDA Topic Model Google Brain Neural Network (Yuan et al. 2015) (Le et al. 2012) The topic matrix has 50 billion entries. The weight matrices have 1.3 billion entries.

Existing Approaches • Parameter server frameworks communicate matrices for parameter synchronization. High Communication Cost

Existing Approaches (Cont’d) • Parameter matrices are checkpointed to stable storage for fault tolerance. High Disk IO

System and Algorithm Co-design System Algorithm Design Design • System design should be tailored to the unique mathematical properties of ML algorithms • Algorithms can be re-designed to better exploit the system architecture

Sufficient Vectors (SVs) • Parameter-update matrix can be computed from a few vectors (referred to as sufficient vectors) Sufficient & + ( Entries &×( Entries Vectors $ # ∆W = ⨂ (Xie et al. 2016)

System and Algorithm Co-design System Design Algorithm Design Random multicast SV selection • • Incremental SV Using SVs to represent • • checkpoint parameter states Periodic centralized Automatic identification • • synchronization of SVs Parameter-replicas • rotation Communication, fault tolerance, consistency, programming interface

Outline • Introduction • Communication • Fault tolerance • Evaluation • Conclusions

Peer-to-Peer Transfer of SVs (Xie et al. 2016)

Cost Comparison Size of one Number of Network message messages Traffic !(' ( ) !((# + %)' ( ) !(# + %) P2P SV-Transfer !(#%) !(') Parameter Server !(#%') J, K: dimensions of the parameter matrix P: number of machines How to reduce the number of messages in P2P?

Random Multicast • Send SVs to a random subset of Q (Q<<P) machines • Reduce number of messages from ! " # to ! "$

Random Multicast (Cont’d) • Correctness is guaranteed due to the error-tolerant nature of ML.

Mini-Batch • It is common to use a mini-batch of training examples (instead of one) to compute updates • If represented as matrices, the updates computed w.r.t different samples can be aggregated into a single update matrix to communicate Training examples Update matrices Aggregated matrix • Communication cost does not grow with mini-batch size

Mini-Batch (Cont’d) • If represented as SVs, the updates computed w.r.t different samples cannot be aggregated into a single SV Training examples Sufficient ! " , # " ! $ , # $ ! % , # % ! & , # & vectors Cannot be aggregated • The SVs must be transmitted individually • Communication cost grows linearly with mini-batch size

SV Selection • Select a subset of “representative” SVs to communicate • Reduce communication cost • Does not hurt the correctness of updates • The aggregated update computed from the selected SVs are close to that from the entire mini-batch • The selected SVs can well represent the others

SV Selection (Cont’d) • Algorithm: joint matrix column subset selection ) (&) / (&) . $ * (&) − . $ * (&) min $ % 0 &'(

SV-based Representation • SV-based representation of parameters • At iteration ! , the state " # of the parameter matrix is Initialization Update matrices …… " # = " + ∆" + ∆" # % ( SV Representation (SVR) + + + " # …… = ) % * % + ) ( * ( ) # * #

Fault Tolerance • SV-based checkpoint: save SVs computed in each clock on disk • Consume little disk bandwidth • Do not halt computation • Recovery: transform saved SVs into parameter matrix • Can rollback to the state of every clock

Convergence Speed Multi-class Logistic Regression (MLR) Weight matrix: 325K-by-20K Convergence time (hours) 25 20 15 10 5 0 Spark-2.0 Gopal TensorFlow-1.0 Bosen MXNet-0.7 SVB Orpheus

Breakdown of Network Waiting Time and Computation Time

SV Selection Full batch, no selection The number of selected SV pairs

Random Multicast Full broadcast, no selection The number of destinations each machine sends messages to

Fault Tolerance

Conclusions System Design Algorithm Design Random multicast SV selection • • Incremental SV Using SVs to represent • • checkpoint parameter states Periodic centralized Automatic identification • • synchronization of SVs Parameter-replicas • rotation Communication, fault tolerance, consistency, programming interface

Orpheus: Efficient Distributed Machine Learning via System and - PowerPoint PPT Presentation

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co- design Pengtao Xie (Petuum Inc) Jin Kyu Kim (CMU), Qirong Ho (Petuum Inc), Yaoliang Yu (University of waterloo), Eric P. Xing (Petuum Inc) Massive Data Distributed

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE,

Automating repository workflows with Orpheus An Open Source database of journals and publishers

Orpheus In The Underworld 19 th -century France Since 1789 and the start of the French

Open Cavity Resonators The Orpheus Experiment Gray Rybka, University of Washington Workshop on

General Council Meeting July 12, 2017 Sarah Goodman Lisa Guay Orpheus Chatzivasileiou Krithika

Opera Productions, Old and New 3. Playing Toy Theaters Glucks Orpheus and its interpreters Alto

The Castrati: A Caricature Glucks Orpheus and its interpreters Alto castrato (1762) Haut -

General Council Meeting June 7, 2017 Sarah Goodman Lisa Guay Orpheus Chatzivasileiou Krithika

Mlbase: distributed machine learning system Adapted slides from mlbase.org S

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

Lecture 5: Parallel machines and models; shared memory programming David Bindel 8 Feb 2010

Single-sided PGAS Communications Libraries Overview of PGAS approaches David Henty, Alan Simpson

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Agenda

CSE 473: Artificial Intelligence Machine Learning: Nave Bayes Hanna Hajishirzi Many slides

SystemML: Declarative Machine Learning on Spark 05/03/19 Presented by: Juan Carrillo Candidate

Machine Learning on Blue Waters Using TensorFlow with the Image Feature Detection Problem Or: