Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - - PowerPoint PPT Presentation
Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - - PowerPoint PPT Presentation
Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria Context: Distributed Training Models are getting too big to fit on just one GPU Turing-NLG (Microsoft) has 17 billion parameters As model training is
Context: Distributed Training
- Models are getting too big to fit on just one GPU
○ Turing-NLG (Microsoft) has 17 billion parameters
- As model training is iterative, communication between different nodes
becomes a bottleneck
- Distributed Training can be split broadly into two different types:
○ Data Parallel ○ Model Parallel
- Even these approaches result in far from optimal parallelization performance
Model Size through the years
Source: Microsoft
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent
Context
- Decentralized algorithms treated as a compromise;
decentralized communication something we resort to and pay a price for
- Current analysis: decentralized PSGD offers no
performance advantage over centralized PSGD assuming decentralized network topology
- Popular ML systems (TensorFlow, PyTorch, etc.)
built to support centralized execution
Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.
Parallel Stochastic Gradient Descent
- Centralized network topology, e.g.
parameter-server model
- Communication bottleneck at central
node(s), performance decreases with increasing network latency
- convergence rate
- communication overhead
Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014.
Parallel Stochastic Gradient Descent
Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems. 2010.
Decentralized PSGD
- Requires either (all nodes access shared database) or e.g.
(data parallel approach)
- Implies asymptotically
- Communication topology represented by undirected graph with
doubly-stochastic weight matrix W
Decentralized PSGD
Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems. 2017.
Analysis: D-PSGD
Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations
Analysis: D-PSGD
Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions:
- Lipschitz continuous gradients (bounded curvature/Hessian spectral radius)
- Weight matrix has bounded spectral gap
- Bounded variance w.r.t. local data sample
- Globally bounded variance
Analysis: D-PSGD
Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD
Analysis: D-PSGD
Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup
Analysis: D-PSGD
Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)
Analysis: Ring network
Analysis: Ring network
(i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes
Proof:
Intuition
- Graph has Laplacian with spectrum
Intuition
- Graph has Laplacian with spectrum
- Weight matrix has spectrum
Intuition
- Graph has Laplacian with spectrum
- Weight matrix has spectrum
- By Perron-Frobenius Theorem,
Intuition
- Graph has Laplacian with spectrum
- Weight matrix has spectrum
- By Perron-Frobenius Theorem,
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Intuition
Recall:
Intuition
Recall: ~ “What’s the worst possible bottleneck between two clusters?”
Intuition
Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”
Intuition
Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”
Intuition
Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,
Evaluation: Image processing
Evaluation: Image processing
Evaluation: EA(M)-SGD
Evaluation: NLP
Beyond Data and Model Parallelism for Deep Neural Networks
Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University
Motivation
- Data and Model parallelization have become the
go-to choices for distributed training
- These limited options result in suboptimal
parallelization performance
- A more comprehensive parallelization search space
may lead to more optimal parallelization strategies
Motivation
- Data and Model parallelization have become the
go-to choices for distributed training
- These limited options result in suboptimal
parallelization performance
- A more comprehensive parallelization search space
may lead to more optimal parallelization strategies
Proposed Solution
SOAP
Sample Operation Attribute Parameter
Proposed Solution
SOAP
Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work
Proposed Solution
SOAP
Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)
Proposed Solution
SOAP
Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)
Proposed Solution
SOAP
Sample Operation AttributeParameter
Proposed Solution
Execution Simulator
- Allows FlexFlow to search much broader search space without needing to actually
execute parallelization strategies
- Assumes operation O on device D takes constant time
- Takes a device topology D, operator graph G, and parallelization strategy S to predict
runtime
- Uses MCMC sampling to iteratively propose strategy S* for allocated time budget
Results - Per-iteration Throughput
Results - Communication Overhead
Results - Novelty
PipeDream: Generalized Pipeline Parallelism for DNN Training
Deepak Narayanan et. al. Microsoft, CMU, Stanford
Intra Batch Parallelism
- Data Parallelism
○ Communication between workers is a bottleneck.
Intra Batch Parallelism
- Model Parallelism
○ Unused resources ○ Partitioning the model left to the programmer
Inter Batch Parallelism: GPipe (Huang et. al.)
- Uses pipelining in the context of model-parallel training for very large models
- Does not specify a partitioning algorithm
- Splits a minibatch into m microbatches
Introducing PipeLine Parallelism
- Combination of inter-batch and
intra-batch parallelism
- Model layers are mapped to
- stages. Each stage consists of
consecutive layers and is mapped to a separate GPU.
Introducing PipeLine Parallelism
- Multiple minibatches
inserted together to take advantage of all machines.
Three Challenges
- Work Partitioning
○ How to partition the DNN model into stages?
- Work Scheduling
○ How does scheduling work in this bi-directional pipeline?
- Effective Learning
○ How to use correct and updated weights for faster learning?
Challenge 1: Work Partitioning
- Goals:
○ Each stage performs roughly same amount
- f computation
○ Inter-Stage communication is minimized
- Profiling:
○ Computation Time (Forward/Backward) ○ Size of layer outputs
- Partitioning Algorithm:
○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy
Challenge 2: Work Scheduling
- Bidirectional Pipeline
○ Each active minibatch in the pipeline may be in a different stage
- Alternative between Forward and Backward Pass (1F1B)
Challenge 3: Effective Learning
- Weight Stashing
○ Maintain weights for each minibatch ○ Forward Pass: use the latest weight ○ Backward Pass: use the corresponding weight
- Vertical Sync
○ After performing the backward pass of a minibatch, apply the latest updates to create new weights