Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - - PowerPoint PPT Presentation

distributed training ii
SMART_READER_LITE
LIVE PREVIEW

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - - PowerPoint PPT Presentation

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria Context: Distributed Training Models are getting too big to fit on just one GPU Turing-NLG (Microsoft) has 17 billion parameters As model training is


slide-1
SLIDE 1

Distributed Training II

Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria

slide-2
SLIDE 2

Context: Distributed Training

  • Models are getting too big to fit on just one GPU

○ Turing-NLG (Microsoft) has 17 billion parameters

  • As model training is iterative, communication between different nodes

becomes a bottleneck

  • Distributed Training can be split broadly into two different types:

○ Data Parallel ○ Model Parallel

  • Even these approaches result in far from optimal parallelization performance
slide-3
SLIDE 3

Model Size through the years

Source: Microsoft

slide-4
SLIDE 4

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent

slide-5
SLIDE 5

Context

  • Decentralized algorithms treated as a compromise;

decentralized communication something we resort to and pay a price for

  • Current analysis: decentralized PSGD offers no

performance advantage over centralized PSGD assuming decentralized network topology

  • Popular ML systems (TensorFlow, PyTorch, etc.)

built to support centralized execution

Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.

slide-6
SLIDE 6

Parallel Stochastic Gradient Descent

  • Centralized network topology, e.g.

parameter-server model

  • Communication bottleneck at central

node(s), performance decreases with increasing network latency

  • convergence rate
  • communication overhead

Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014.

slide-7
SLIDE 7

Parallel Stochastic Gradient Descent

Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems. 2010.

slide-8
SLIDE 8

Decentralized PSGD

  • Requires either (all nodes access shared database) or e.g.

(data parallel approach)

  • Implies asymptotically
  • Communication topology represented by undirected graph with

doubly-stochastic weight matrix W

slide-9
SLIDE 9

Decentralized PSGD

Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems. 2017.

slide-10
SLIDE 10

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations

slide-11
SLIDE 11

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions:

  • Lipschitz continuous gradients (bounded curvature/Hessian spectral radius)
  • Weight matrix has bounded spectral gap
  • Bounded variance w.r.t. local data sample
  • Globally bounded variance
slide-12
SLIDE 12

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD

slide-13
SLIDE 13

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup

slide-14
SLIDE 14

Analysis: D-PSGD

Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)

slide-15
SLIDE 15

Analysis: Ring network

slide-16
SLIDE 16

Analysis: Ring network

(i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes

slide-17
SLIDE 17

Proof:

slide-18
SLIDE 18

Intuition

  • Graph has Laplacian with spectrum
slide-19
SLIDE 19

Intuition

  • Graph has Laplacian with spectrum
  • Weight matrix has spectrum
slide-20
SLIDE 20

Intuition

  • Graph has Laplacian with spectrum
  • Weight matrix has spectrum
  • By Perron-Frobenius Theorem,
slide-21
SLIDE 21

Intuition

  • Graph has Laplacian with spectrum
  • Weight matrix has spectrum
  • By Perron-Frobenius Theorem,
slide-22
SLIDE 22

Some facts from spectral graph theory

slide-23
SLIDE 23

Some facts from spectral graph theory

slide-24
SLIDE 24

Some facts from spectral graph theory

slide-25
SLIDE 25

Some facts from spectral graph theory

slide-26
SLIDE 26

Some facts from spectral graph theory

slide-27
SLIDE 27

Some facts from spectral graph theory

slide-28
SLIDE 28

Some facts from spectral graph theory

slide-29
SLIDE 29

Some facts from spectral graph theory

slide-30
SLIDE 30

Intuition

Recall:

slide-31
SLIDE 31

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?”

slide-32
SLIDE 32

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”

slide-33
SLIDE 33

Intuition

Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”

slide-34
SLIDE 34

Intuition

Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,

slide-35
SLIDE 35

Evaluation: Image processing

slide-36
SLIDE 36

Evaluation: Image processing

slide-37
SLIDE 37

Evaluation: EA(M)-SGD

slide-38
SLIDE 38

Evaluation: NLP

slide-39
SLIDE 39

Beyond Data and Model Parallelism for Deep Neural Networks

Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University

slide-40
SLIDE 40

Motivation

  • Data and Model parallelization have become the

go-to choices for distributed training

  • These limited options result in suboptimal

parallelization performance

  • A more comprehensive parallelization search space

may lead to more optimal parallelization strategies

slide-41
SLIDE 41

Motivation

  • Data and Model parallelization have become the

go-to choices for distributed training

  • These limited options result in suboptimal

parallelization performance

  • A more comprehensive parallelization search space

may lead to more optimal parallelization strategies

slide-42
SLIDE 42

Proposed Solution

SOAP

Sample Operation Attribute Parameter

slide-43
SLIDE 43

Proposed Solution

SOAP

Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work

slide-44
SLIDE 44

Proposed Solution

SOAP

Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)

slide-45
SLIDE 45

Proposed Solution

SOAP

Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)

slide-46
SLIDE 46

Proposed Solution

SOAP

Sample Operation AttributeParameter

slide-47
SLIDE 47

Proposed Solution

Execution Simulator

  • Allows FlexFlow to search much broader search space without needing to actually

execute parallelization strategies

  • Assumes operation O on device D takes constant time
  • Takes a device topology D, operator graph G, and parallelization strategy S to predict

runtime

  • Uses MCMC sampling to iteratively propose strategy S* for allocated time budget
slide-48
SLIDE 48

Results - Per-iteration Throughput

slide-49
SLIDE 49

Results - Communication Overhead

slide-50
SLIDE 50

Results - Novelty

slide-51
SLIDE 51

PipeDream: Generalized Pipeline Parallelism for DNN Training

Deepak Narayanan et. al. Microsoft, CMU, Stanford

slide-52
SLIDE 52

Intra Batch Parallelism

  • Data Parallelism

○ Communication between workers is a bottleneck.

slide-53
SLIDE 53

Intra Batch Parallelism

  • Model Parallelism

○ Unused resources ○ Partitioning the model left to the programmer

slide-54
SLIDE 54

Inter Batch Parallelism: GPipe (Huang et. al.)

  • Uses pipelining in the context of model-parallel training for very large models
  • Does not specify a partitioning algorithm
  • Splits a minibatch into m microbatches
slide-55
SLIDE 55

Introducing PipeLine Parallelism

  • Combination of inter-batch and

intra-batch parallelism

  • Model layers are mapped to
  • stages. Each stage consists of

consecutive layers and is mapped to a separate GPU.

slide-56
SLIDE 56

Introducing PipeLine Parallelism

  • Multiple minibatches

inserted together to take advantage of all machines.

slide-57
SLIDE 57

Three Challenges

  • Work Partitioning

○ How to partition the DNN model into stages?

  • Work Scheduling

○ How does scheduling work in this bi-directional pipeline?

  • Effective Learning

○ How to use correct and updated weights for faster learning?

slide-58
SLIDE 58

Challenge 1: Work Partitioning

  • Goals:

○ Each stage performs roughly same amount

  • f computation

○ Inter-Stage communication is minimized

  • Profiling:

○ Computation Time (Forward/Backward) ○ Size of layer outputs

  • Partitioning Algorithm:

○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy

slide-59
SLIDE 59

Challenge 2: Work Scheduling

  • Bidirectional Pipeline

○ Each active minibatch in the pipeline may be in a different stage

  • Alternative between Forward and Backward Pass (1F1B)
slide-60
SLIDE 60

Challenge 3: Effective Learning

  • Weight Stashing

○ Maintain weights for each minibatch ○ Forward Pass: use the latest weight ○ Backward Pass: use the corresponding weight

  • Vertical Sync

○ After performing the backward pass of a minibatch, apply the latest updates to create new weights

slide-61
SLIDE 61

Experiments and Results

slide-62
SLIDE 62

Experiments and Results

slide-63
SLIDE 63

Thank You!