PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler Dan Alistarh ETH Zurich IST Austria PPoPP ’ 20 Feb. 22-26, 2020 San Diego, CA, US

spcl.inf.ethz.ch @spcl_eth Deep learning training Model parallelism The overall objective function: P2 P0 P1 w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 2

spcl.inf.ethz.ch @spcl_eth Deep learning training Pipeline parallelism P0 P1 The overall P2 objective function: w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 3

spcl.inf.ethz.ch @spcl_eth Deep learning training Data parallelism Global synchronization using Allreduce The overall objective function: P0 P1 P2 w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 4

spcl.inf.ethz.ch @spcl_eth Unbalanced training workloads ▪ Load imbalance on application level ▪ Recurrent Neural Networks (RNN/LSTM/GRU) ▪ Transformers (One input (Multiple inputs (Multiple inputs Challenge: stragglers multiple outputs) one output) multiple outputs) dominate the performance. Different types of RNNs ▪ Load imbalance on system level ▪ Performance variability on multitenant Interrupts, cloud systems daemon, ▪ System or network noise page/cache misses, et al. Multitenant cloud system 5

spcl.inf.ethz.ch @spcl_eth Many-to-one RNN for video classification Backward pass L ( w ) 0.13 Playing L ( W 1 ) L ( W 2 ) L ( W 3 ) L ( W T ) 0.14 Basketball 0.41 … h 0 f w h 1 f w h 2 f w h 3 h T 0.09 0.13 0.10 FC 2 FC 1 x 1 x 2 x 3 x T Workload is proportional to T RNN: 6

spcl.inf.ethz.ch @spcl_eth Workload statistics for video classification Distribution : 29 ~ 1,776 frames Distribution: 201 ~ 3,410 ms Mean : 187 frames Mean: 1,235 ms Standard deviation : 97 frames Standard deviation: 706 ms (a) Video length distribution for UCF101 dataset (b) Runtime distribution for the mini- batches to train a LSTM model on P100 7

spcl.inf.ethz.ch @spcl_eth . Transformer power [1] Vaswani, Ashish, Noam Distribution: 179 ~ 3,482 ms Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Mean: 475 ms Gomez, Łukasz Kaiser, and Illia Polosukhin. " Attention is all Standard deviation: 144 ms you need ." In Advances in NeurIPS , pp. 5998-6008. 2017. Decoder Encoder Runtime distribution for the mini-batches to train a Transformer model (using WMT16) on P100 知识就是力量。 . Knowledge is power ? ? The workload is proportional to input_size * output_size . 8

spcl.inf.ethz.ch @spcl_eth Training on Cloud Distribution: 399 ~ 1,892 ms Mean: 454 ms Standard deviation: 116 ms Runtime distribution on Google Cloud with 2xV100 GPUs (batch size=256, ResNet-50 on ImageNet). ▪ Compared with imbalanced applications (e.g., LSTM, Transformer), the load imbalance on cloud servers is relatively light. 9

spcl.inf.ethz.ch @spcl_eth Deep learning training is robust f(g) f(g) g g Allreduce 0.5 p(g) f(g) Top-k Top-k g g 0 1 Gradients 1-bit gradients Hidden units sparsification quantization dropout P+1 P-1 P Gossiping 10

spcl.inf.ethz.ch @spcl_eth Eager-SGD to solve the load imbalance problem (b) eager-SGD (a) synch-SGD W (1) W (2) idle W (1) W (2) Process 0 Process 0 W (0) W (0) partial-allreduce partial-allreduce synch-allreduce synch-allreduce W (1) W (2) W (1) W (2) Process n idle Process n Time Time Eager-SGD exploits the robustness of the training by allowing allreduce on stale gradients. Gossip-based SGDs Communication Number of steps for Consistency mode participants update propagation D-PSGD [1] 2 O(P) synchronous AD-PSGD [2] 1 O(logP) asynchronous eager-SGD P 1 asynchronous 11

spcl.inf.ethz.ch @spcl_eth Partial Allreduce operations ▪ Two phases: the activation and the collective operation P3 schedule ▪ P0 P1 P2 P3 Asynchronous execution : an S0 auxiliary thread would progress the Activation Activation R0 R1 N0 execution (activation and collective) R0 in the background. S1 N1 S0 S1 R1 ▪ Multiple initiators: the same S2 R2 R3 S2 Allreduce Allreduce operation is only executed once R2 even if we may have multiple C0 S3 initiators, i.e. multiple processes S3 C1 arrive at the same time. R3 12

spcl.inf.ethz.ch @spcl_eth Solo allreduce and majority allreduce ▪ Two variants: solo allreduce [3] and majority allreduce. ▪ For solo, at least one process “actively” participates. ▪ For majority, a majority of processes must “actively” participate. Solo allreduce Majority allreduce Initiator The fastest process A randomly specified process Attributes Wait-free Wait for the randomly specified initiator The expectation of Ω (1) Ω (P/2) the participants [3] Di Girolamo, Salvatore, Pierre Jolivet, Keith D. Underwood, and Torsten Hoefler. "Exploiting offload enabled network interfaces." In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects , pp. 26-33. IEEE, 2015. 13

spcl.inf.ethz.ch @spcl_eth Implementation eager-SGD based on Tensorflow control Addition dependency All- 1 Addition reduce All- 2 Conv-BN Conv-BN Conv-BN Conv-BN reduce Conv-BN-ReLU All- 3 Conv-BN-ReLU reduce Max Pool All- 4 Max Pool reduce backward pass forward pass Customized distributed optimizer based on Tensorflow Eager-SGD utilizes the execution engine of TF to exploit the parallelism in the computation DAG. 14

spcl.inf.ethz.ch @spcl_eth Execution of eager-SGD P0 P1 1. Two processes and P1 is faster. w t 1 w t 0 G t 1 2. P1 finishes the calculation for the G null G t 1 sendbuff 0 sendbuff 1 step t gradients of step t , and triggers partial- partial-allreduce 1 G t allreduce. P0 contributes NULL. G t 1 G t 0 recvbuff 0 recvbuff 1 1 w t+1 3. P0 finishes step t , and discovers partial- ( , ) U G t ( , ) 1 U G t w t+1 1 0 allreduce is already done. P0 copies the G t G t 0 0 G t+1 1 stale gradients to its send buffer. sendbuff 0 0 + 0 ' G t+1 =G t+1 G t 0 sendbuff 0 sendbuff 1 step t+1 4. P0 catches up P1 in step t+1 . The stale 1 0 ' G t+1 G t+1 partial-allreduce gradients are combined with the latest recvbuff 0 recvbuff 1 0 ' 0 ' +G t+1 1 G t+1 G t+1 +G t+1 1 gradients, and then commit to partial- allreduce. Computation thread Communication thread 15

spcl.inf.ethz.ch @spcl_eth Convergence of eager-SGD ▪ For a learning rate value , eager-SGD converges after ▪ Note the dependence in 𝜐 iterations. Staleness (staleness bound) and 𝑄 - 𝑅 (the bound number of stale gradients) for The total iterations T . number of ▪ Eager-SGD would converge processes slower if too many stale The number of gradients are used. processes which contribute the latest gradients 16

spcl.inf.ethz.ch @spcl_eth Evaluation ▪ CSCS Piz Daint supercomputer. ▪ Cray Aries interconnected network. ▪ Cray MPICH 7.7.2 communication library. ▪ Each node contains a 12-core Intel Xeon E5-2690 CPU, and one NVIDIA Tesla P100 GPU. ▪ We compare eager-SGD with the allreduce-based synch-SGD ( Horovod and Deep500 ), the asynchronous centralized SGD ( TF parameter server ), and the gossip SGDs ( D-PSGD , SGP ). Simulated load imbalance (traces on cloud machine) Table 1. Neural networks used for evaluation Inherent load imbalance 17

spcl.inf.ethz.ch @spcl_eth Hyperplane regression (light load imbalance) ▪ Eager-SGD (solo) achieves 1.50x , 1.75x , and 2.01x speedup over synch-SGD (Deep500), respectively. ▪ The loss value is equivalent with synch-SGD (Deep500). Synch-SGD vs eager-SGD for hyperplane regression using 8 GPUs. "synch/eager-SGD-200/300/400" represent 200/300/400 ms load imbalance injection for 1 out of 8 processes. 18

spcl.inf.ethz.ch @spcl_eth ResNet-50 on ImageNet (light load imbalance) Synch-SGD vs eager-SGD for ResNet-50 on ImageNet using 64 GPUs. "synch/eager-SGD- 300/460" represent 300/460 ms load imbalance injection for 4 out of 64 processes. 1,4 Throughput (steps/second) 1,2 1 0,8 0,6 0,4 0,2 0 Asynch-PS D-PSGD SGP eager-SGD ▪ Eager-SGD (solo) achieves 1.25x and 1.29x speedup ▪ Eager-SGD (solo) achieves 2.64x , 1.26x , over Deep500, respectively; 1.14x and 1.27x 1.17x over aysnch-PS and gossip-based SGDs speedup over Horovod, respectively. Top-1 accuracy (D-PSGD, SGP) respectively. is almost equivalent (75.2% vs 75.8%). 19

spcl.inf.ethz.ch @spcl_eth LSTM on UCF101 (severe load imbalance) s s a u a eager-SGD eager-SGD s (solo) (majority) s Speedup over 1.64x 1.27x s Horovod a s Top-1 test 60.6% on average, 69.7% on average, a a s accuracy up to 70.4% up to 72.8% a s a a a s s Top-1 test accuracy and runtime for LSTM on UCF101 using 8 GPUs. 20

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler Dan Alistarh ETH Zurich IST Austria PPoPP 20 Feb.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009

Parallel Thinking * Guy Blelloch Carnegie Mellon University * PROBE as part of the Center for

HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED AT PPOPP MARCH 2006

Systems 01/27 /2014 Heechul Yun 1 Administrative Next summary assignment due Efficient

Teleport Messaging for Distributed Stream Programs William Thies, Michal Karczmarek, Janis

Scalable Communication Protocols for Dynamic Sparse Data Exchange Torsten Hoefler, Christian

A General Technique for Non-blocking Trees Trevor Brown, University of Toronto, Canada Faith

Concurrent Binary Search Tree Nathan Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao,

Mining Software Data Mara Gmez Software Engineering Course Summer Semester 2017 How

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois

Bernstein-Zelevinsky Derivative and Their Analogues AFW Workshop, Duquesne U Pittsburgh Zhuohui

3/12/2019 Background, Classification, & Incidence Background, Classification, & Incidence

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bul, Lorenzo

Ontological Engineering Ontological Engineering Asuncin Gmez-Prez (asun@fi.upm.es) Mari

GIT characterizations of Harder-Narasimhan filtrations Alfonso Zamora Instituto Superior

17 Applications 2: Recognition/Generation of Continuous In- puts While most of the previous

Jorge Gr Jo Gracia ia Jose Labra Jo Labra Ontology Engineering Group (OEG) Web Semantics

3 Modeling Web Applications Wieland Schwinger, Nora Koch It is not (yet) common to model Web

Sambuz

Useful Links

Newsletter

Mail Us

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler Dan Alistarh ETH Zurich IST Austria PPoPP 20 Feb.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009

Parallel Thinking * Guy Blelloch Carnegie Mellon University * PROBE as part of the Center for

HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED AT PPOPP MARCH 2006

Systems 01/27 /2014 Heechul Yun 1 Administrative Next summary assignment due Efficient

Teleport Messaging for Distributed Stream Programs William Thies, Michal Karczmarek, Janis

Scalable Communication Protocols for Dynamic Sparse Data Exchange Torsten Hoefler, Christian

A General Technique for Non-blocking Trees Trevor Brown, University of Toronto, Canada Faith

Concurrent Binary Search Tree Nathan Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao,

Mining Software Data Mara Gmez Software Engineering Course Summer Semester 2017 How

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE &amp; CONTEXT by Dylan Bourgeois

Bernstein-Zelevinsky Derivative and Their Analogues AFW Workshop, Duquesne U Pittsburgh Zhuohui

3/12/2019 Background, Classification, &amp; Incidence Background, Classification, &amp; Incidence

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bul, Lorenzo

Ontological Engineering Ontological Engineering Asuncin Gmez-Prez (asun@fi.upm.es) Mari

GIT characterizations of Harder-Narasimhan filtrations Alfonso Zamora Instituto Superior

17 Applications 2: Recognition/Generation of Continuous In- puts While most of the previous

Jorge Gr Jo Gracia ia Jose Labra Jo Labra Ontology Engineering Group (OEG) Web Semantics

3 Modeling Web Applications Wieland Schwinger, Nora Koch It is not (yet) common to model Web

Sambuz

Useful Links

Newsletter

Mail Us

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois

3/12/2019 Background, Classification, & Incidence Background, Classification, & Incidence