[PPT] - Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, PowerPoint Presentation

SLIDE 1

Distributed Training II

Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria

SLIDE 2

Context: Distributed Training

Models are getting too big to fit on just one GPU

○ Turing-NLG (Microsoft) has 17 billion parameters

As model training is iterative, communication between different nodes

becomes a bottleneck

Distributed Training can be split broadly into two different types:

○ Data Parallel ○ Model Parallel

Even these approaches result in far from optimal parallelization performance

SLIDE 3

Model Size through the years

Source: Microsoft

SLIDE 4

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent

SLIDE 5

Context

Decentralized algorithms treated as a compromise;

decentralized communication something we resort to and pay a price for

Current analysis: decentralized PSGD offers no

performance advantage over centralized PSGD assuming decentralized network topology

Popular ML systems (TensorFlow, PyTorch, etc.)

built to support centralized execution

Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.

SLIDE 6

Parallel Stochastic Gradient Descent

Centralized network topology, e.g.

parameter-server model

Communication bottleneck at central

node(s), performance decreases with increasing network latency

convergence rate
communication overhead

Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014.

SLIDE 7

Parallel Stochastic Gradient Descent

Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems. 2010.

SLIDE 8

Decentralized PSGD

Requires either (all nodes access shared database) or e.g.

(data parallel approach)

Implies asymptotically
Communication topology represented by undirected graph with

doubly-stochastic weight matrix W

SLIDE 9

Decentralized PSGD

Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems. 2017.

SLIDE 10

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations

SLIDE 11

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions:

Lipschitz continuous gradients (bounded curvature/Hessian spectral radius)
Weight matrix has bounded spectral gap
Bounded variance w.r.t. local data sample
Globally bounded variance

SLIDE 12

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD

SLIDE 13

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup

SLIDE 14

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)

SLIDE 15

Analysis: Ring network

SLIDE 16

Analysis: Ring network

(i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes

SLIDE 17

Proof:

SLIDE 18

Intuition

Graph has Laplacian with spectrum

SLIDE 19

Intuition

Graph has Laplacian with spectrum
Weight matrix has spectrum

SLIDE 20

Intuition

Graph has Laplacian with spectrum
Weight matrix has spectrum
By Perron-Frobenius Theorem,

SLIDE 21

Intuition

Graph has Laplacian with spectrum
Weight matrix has spectrum
By Perron-Frobenius Theorem,

SLIDE 22

Some facts from spectral graph theory

SLIDE 23

Some facts from spectral graph theory

SLIDE 24

Some facts from spectral graph theory

SLIDE 25

Some facts from spectral graph theory

SLIDE 26

Some facts from spectral graph theory

SLIDE 27

Some facts from spectral graph theory

SLIDE 28

Some facts from spectral graph theory

SLIDE 29

Some facts from spectral graph theory

SLIDE 30

Intuition

Recall:

SLIDE 31

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?”

SLIDE 32

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”

SLIDE 33

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”

SLIDE 34

Intuition

Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,

SLIDE 35

Evaluation: Image processing

SLIDE 36

Evaluation: Image processing

SLIDE 37

Evaluation: EA(M)-SGD

SLIDE 38

Evaluation: NLP

SLIDE 39

Beyond Data and Model Parallelism for Deep Neural Networks

Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University

SLIDE 40

Motivation

Data and Model parallelization have become the

go-to choices for distributed training

These limited options result in suboptimal

parallelization performance

A more comprehensive parallelization search space

may lead to more optimal parallelization strategies

SLIDE 41

Motivation

Data and Model parallelization have become the

go-to choices for distributed training

These limited options result in suboptimal

parallelization performance

A more comprehensive parallelization search space

may lead to more optimal parallelization strategies

SLIDE 42

Proposed Solution

SOAP

Sample Operation Attribute Parameter

SLIDE 43

Proposed Solution

SOAP

Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work

SLIDE 44

Proposed Solution

SOAP

Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)

SLIDE 45

Proposed Solution

SOAP

Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)

SLIDE 46

Proposed Solution

SOAP

Sample Operation AttributeParameter

SLIDE 47

Proposed Solution

Execution Simulator

Allows FlexFlow to search much broader search space without needing to actually

execute parallelization strategies

Assumes operation O on device D takes constant time
Takes a device topology D, operator graph G, and parallelization strategy S to predict

runtime

Uses MCMC sampling to iteratively propose strategy S* for allocated time budget

SLIDE 48

Results - Per-iteration Throughput

SLIDE 49

Results - Communication Overhead

SLIDE 50

Results - Novelty

SLIDE 51

PipeDream: Generalized Pipeline Parallelism for DNN Training

Deepak Narayanan et. al. Microsoft, CMU, Stanford

SLIDE 52

Intra Batch Parallelism

Data Parallelism

○ Communication between workers is a bottleneck.

SLIDE 53

Intra Batch Parallelism

Model Parallelism

○ Unused resources ○ Partitioning the model left to the programmer

SLIDE 54

Inter Batch Parallelism: GPipe (Huang et. al.)

Uses pipelining in the context of model-parallel training for very large models
Does not specify a partitioning algorithm
Splits a minibatch into m microbatches

SLIDE 55

Introducing PipeLine Parallelism

Combination of inter-batch and

intra-batch parallelism

Model layers are mapped to
stages. Each stage consists of

consecutive layers and is mapped to a separate GPU.

SLIDE 56

Introducing PipeLine Parallelism

Multiple minibatches

inserted together to take advantage of all machines.

SLIDE 57

Three Challenges

Work Partitioning

○ How to partition the DNN model into stages?

Work Scheduling

○ How does scheduling work in this bi-directional pipeline?

Effective Learning

○ How to use correct and updated weights for faster learning?

SLIDE 58

Challenge 1: Work Partitioning

Goals:

○ Each stage performs roughly same amount

f computation

○ Inter-Stage communication is minimized

Profiling:

○ Computation Time (Forward/Backward) ○ Size of layer outputs

Partitioning Algorithm:

○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy

SLIDE 59

Challenge 2: Work Scheduling

Bidirectional Pipeline

○ Each active minibatch in the pipeline may be in a different stage

Alternative between Forward and Backward Pass (1F1B)

SLIDE 60

Challenge 3: Effective Learning

Weight Stashing

○ Maintain weights for each minibatch ○ Forward Pass: use the latest weight ○ Backward Pass: use the corresponding weight

Vertical Sync

○ After performing the backward pass of a minibatch, apply the latest updates to create new weights

SLIDE 61

Experiments and Results

SLIDE 62

Experiments and Results

SLIDE 63

Distributed Training II

Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria

Context: Distributed Training

Model Size through the years

Context

Parallel Stochastic Gradient Descent

Parallel Stochastic Gradient Descent

Decentralized PSGD

Decentralized PSGD

Analysis: D-PSGD

Analysis: D-PSGD

Analysis: D-PSGD

Analysis: D-PSGD

Analysis: D-PSGD

Analysis: Ring network

Analysis: Ring network

Proof:

Intuition

Intuition

Intuition

Intuition

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Some facts from spectral graph theory

Intuition

Intuition

Intuition

Intuition

Intuition

Evaluation: Image processing

Evaluation: Image processing

Evaluation: EA(M)-SGD

Evaluation: NLP

Motivation

Motivation

Proposed Solution

SOAP

Proposed Solution

SOAP

Proposed Solution

SOAP

Proposed Solution

SOAP

Proposed Solution

SOAP

Proposed Solution

Results - Per-iteration Throughput

Results - Communication Overhead

Results - Novelty

Intra Batch Parallelism

Intra Batch Parallelism

Inter Batch Parallelism: GPipe (Huang et. al.)

Introducing PipeLine Parallelism

Introducing PipeLine Parallelism

Three Challenges

Challenge 1: Work Partitioning

Challenge 2: Work Scheduling

Challenge 3: Effective Learning

Experiments and Results

Experiments and Results

Thank You!