Orpheus: Efficient Distributed Machine Learning via System and - - PowerPoint PPT Presentation

orpheus efficient distributed machine learning via system
SMART_READER_LITE
LIVE PREVIEW

Orpheus: Efficient Distributed Machine Learning via System and - - PowerPoint PPT Presentation

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co- design Pengtao Xie (Petuum Inc) Jin Kyu Kim (CMU), Qirong Ho (Petuum Inc), Yaoliang Yu (University of waterloo), Eric P. Xing (Petuum Inc) Massive Data Distributed


slide-1
SLIDE 1

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co- design

Pengtao Xie (Petuum Inc) Jin Kyu Kim (CMU), Qirong Ho (Petuum Inc), Yaoliang Yu (University of waterloo), Eric P. Xing (Petuum Inc)

slide-2
SLIDE 2

Massive Data

slide-3
SLIDE 3

Distributed ML Systems

Yahoo LDA DistBelief Project Adam Li & Smola PS

Parameter Server Systems

Pregel

Graph Processing Systems Dataflow Systems

Bosen

Hybrid Systems

GeePS GraphX

slide-4
SLIDE 4

Matrix-Parameterized Models (MPMs)

  • Model parameters are represented as a matrix
  • Other examples: Topic Model, Multiclass Logistic Regression, Distance

Metric Learning, Sparse Coding, Group Lasso, etc.

Neural Network

! " Neurons in hidden layer 2 Neurons in hidden layer 1 ! " #!"

slide-5
SLIDE 5

Parameter Matrices Could Be Very Large

LightLDA Topic Model (Yuan et al. 2015) The topic matrix has 50 billion entries. Google Brain Neural Network (Le et al. 2012) The weight matrices have 1.3 billion entries.

slide-6
SLIDE 6

Existing Approaches

  • Parameter server frameworks communicate matrices for parameter

synchronization.

High Communication Cost

slide-7
SLIDE 7

Existing Approaches (Cont’d)

  • Parameter matrices are checkpointed to stable storage for fault

tolerance.

High Disk IO

slide-8
SLIDE 8

System and Algorithm Co-design

  • System design should be tailored to the unique mathematical

properties of ML algorithms

  • Algorithms can be re-designed to better exploit the system

architecture System Design Algorithm Design

slide-9
SLIDE 9

Sufficient Vectors (SVs)

  • Parameter-update matrix can be computed from a few vectors

(referred to as sufficient vectors)

∆W # $ =

&×( Entries & + ( Entries Sufficient Vectors (Xie et al. 2016)

slide-10
SLIDE 10
  • Random multicast
  • Incremental SV

checkpoint

  • Periodic centralized

synchronization

  • Parameter-replicas

rotation

System and Algorithm Co-design

System Design Algorithm Design

  • SV selection
  • Using SVs to represent

parameter states

  • Automatic identification
  • f SVs

Communication, fault tolerance, consistency, programming interface

slide-11
SLIDE 11

Outline

  • Introduction
  • Communication
  • Fault tolerance
  • Evaluation
  • Conclusions
slide-12
SLIDE 12

Peer-to-Peer Transfer of SVs

(Xie et al. 2016)

slide-13
SLIDE 13

Cost Comparison

J, K: dimensions of the parameter matrix P: number of machines Size of one message Number of messages Network Traffic P2P SV-Transfer !((# + %)'() Parameter Server !(#%')

!('() !(') !(# + %) !(#%)

How to reduce the number of messages in P2P?

slide-14
SLIDE 14

Random Multicast

  • Send SVs to a random subset of Q (Q<<P) machines
  • Reduce number of messages from ! "# to ! "$
slide-15
SLIDE 15

Random Multicast (Cont’d)

  • Correctness is guaranteed due to the error-tolerant nature of ML.
slide-16
SLIDE 16

Mini-Batch

  • It is common to use a mini-batch of training examples (instead of one) to compute

updates

  • If represented as matrices, the updates computed w.r.t different samples can be

aggregated into a single update matrix to communicate

  • Communication cost does not grow with mini-batch size

Training examples Update matrices Aggregated matrix

slide-17
SLIDE 17

Mini-Batch (Cont’d)

  • If represented as SVs, the updates computed w.r.t different samples cannot

be aggregated into a single SV

  • The SVs must be transmitted individually
  • Communication cost grows linearly with mini-batch size

Training examples Sufficient vectors

!", #" !$, #$ !%, #% !&, #&

Cannot be aggregated

slide-18
SLIDE 18

SV Selection

  • Select a subset of “representative” SVs to communicate
  • Reduce communication cost
  • Does not hurt the correctness of updates
  • The aggregated update computed from the selected SVs are close to that

from the entire mini-batch

  • The selected SVs can well represent the others
slide-19
SLIDE 19

SV Selection (Cont’d)

  • Algorithm: joint matrix column subset selection

min$ %

&'( )

*(&) − .$

(&) .$ (&) /

*(&)

slide-20
SLIDE 20

Outline

  • Introduction
  • Communication
  • Fault tolerance
  • Evaluation
  • Conclusions
slide-21
SLIDE 21

SV-based Representation

  • SV-based representation of parameters
  • At iteration !, the state "# of the parameter matrix is

"# = "

%

+ ∆"

(

∆"# ……

Initialization Update matrices

+ "# = + …… )%*%

+

)(*(

+

)#*#

+

SV Representation (SVR)

slide-22
SLIDE 22

Fault Tolerance

  • SV-based checkpoint: save SVs computed in each clock on disk
  • Consume little disk bandwidth
  • Do not halt computation
  • Recovery: transform saved SVs into parameter matrix
  • Can rollback to the state of every clock
slide-23
SLIDE 23

Outline

  • Introduction
  • Communication
  • Fault tolerance
  • Evaluation
  • Conclusions
slide-24
SLIDE 24

Convergence Speed

Multi-class Logistic Regression (MLR) Weight matrix: 325K-by-20K

5 10 15 20 25 Spark-2.0 Gopal TensorFlow-1.0 Bosen MXNet-0.7 SVB Orpheus

Convergence time (hours)

slide-25
SLIDE 25

Breakdown of Network Waiting Time and Computation Time

slide-26
SLIDE 26

SV Selection

The number of selected SV pairs Full batch, no selection

slide-27
SLIDE 27

Random Multicast

The number of destinations each machine sends messages to Full broadcast, no selection

slide-28
SLIDE 28

Fault Tolerance

slide-29
SLIDE 29
  • Random multicast
  • Incremental SV

checkpoint

  • Periodic centralized

synchronization

  • Parameter-replicas

rotation

Conclusions

System Design Algorithm Design

  • SV selection
  • Using SVs to represent

parameter states

  • Automatic identification
  • f SVs

Communication, fault tolerance, consistency, programming interface