Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - - PowerPoint PPT Presentation
Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - - PowerPoint PPT Presentation
7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD Statistics Review Simple y = $ + regression Least Squares to find m,b With data set { ) , ) } )-.,..,0 Very special, often hard to
Statistics Review
- Simple y = π $ π¦ + π regression
- Least Squares to find m,b
- With data set { π¦), π§) } )-.,..,0
- Very special, often hard to measure π§)
- Let the error be
- π = β)-.
0 [(π§) β π $ π¦) + π ]7
- Minimize π with respect to π and π.
- Simultaneously Solve
- π8 π, π = 0
- π:(π, π) = 0
- Linear System
- We will consider more general π§ = π(π¦)
- π8 π, π = 0 and π: π, π = 0 may not be linear
2
Statistics Review
- Regressions with parameterized sets of functions. e.g.
- π§ = ππ¦7 + ππ¦ + π (quadratic)
- π§ = β π) π¦) (polynomial)
- π§ = ππAB(exponential)
- π§ =
. .CDE(FGHI) (logistic)
3
Statistics Review
- Polynomial model of degree βnβ
- βdegrees of freedomβ - models capacity
4 Deep Learning, Goodfellow et. al., MIT Press, http://www.deeplearningbook.org, 2016
Gradient Decent
5
- Searching for minimum
- πΌπ = πKL, πKM, β¦ , πKO
- π β
πRC. = π β πR + πΏπΌπ
- πΏ: Learning Rate
- Recall, Loss depends on data
Expand notation,
- π β
πR; { π¦), π§
) } 0
- Recall π and πΌπ is a sum over π
- Intuitively, want π with
ALL DATA β¦.. ? (π = β)-.
0 [(π§) β π KW(π¦))]7)
Gradient Decent
6
Stochastic Gradient Decent
7
- Recall π is a sum over π (π = β)-.
0 [(π§) β π KW(π¦))]7)
- Single training example, π¦), π§) , Sum over only one training example
- πΌπ BX,YX = πKL, πKM, β¦ , πKO
BX,YX
- π BX,YX
β πRC. = π BX,YX β πR + πΏπΌπ BX,YX
- πΏ: Learning Rate
- Choose next π¦)C., π§)C. , (Shuffled training set)
- SGD with mini batches
- Many training example, π¦), π§) , Sum over many training example
- Batch Size or Mini Batch Size (This gets ambiguous with distributed training)
- SGD often outperforms traditional GD, want small batches.
- https://arxiv.org/abs/1609.04836, On Large-Batch Training β¦ Sharp Minima
- https://arxiv.org/abs/1711.04325, Extremely Large ... in 15 Minutes
Neural Networks
- Activation functions
- Softmax
- π[ π¦., π¦7, β¦ , π¦\ =
DI] β DIX
8
- 7.5
- 5
- 2.5
- 2.5
- 2
- 1.5
- 1
- 0.5
- 2
- 1.5
- 1
- 0.5
π π¦ = π π¦ = π π¦ = Logistic Arctan ReLU (Rectified Linear Unit)
Neural Networks
9
- Parameterized function
- π` = π π½b8 + π½8π
- πe = πΎb[ + πΎ[π
- π
e π = π[ π
- Linear Transformations with
pointwise evaluation of nonlinear function, π
- πΎb), πΎ), π½b8, π½8
- Weights to be optimized
π π π β π
Faux Model Example
10
Distributed Training, data distributed
11
Distributed Training, data distributed
12
Distributed Training, All Reduce Collective
13
Distributed TensorFlow: Parameter Sever/Worker Default, Bad Way on HPC
14
ps:0
Aggregate
Update Parameters
worker:0 ps:1
Aggregate
Update Parameters
Model
Loss (Cross Entropy) Optimize (Gradient Decent)
worker:1 Model
Loss (Cross Entropy) Optimize (Gradient Decent)
worker:2 Model
Loss (Cross Entropy) Optimize (Gradient Decent)
Other models: Sequence Modeling
- Autoregression
- Autocorrelation
- Other tasks
- Semantic Labeling
πR = π + β
)-. i
π)πΆ)πR + πR
Back Shift Operatior: πΆ)
π{{(π’., π’7) = πΉ[πR~πRM]
[art.] [adj.] [adj.] [n.] [v.] [adverb] [art.] [adj.] [adj.] [d.o.] The quick red fox jumps over the lazy brown dog
Recurrent Neural Networks: Sequence Modeling
- Few projects use pure RNNs, this
example is only for pedagogy
- RNN is a model that is as βdeepβ as the
modeled sequence is long
- LSTMβs, Gated recurrent unit,
- No Model Parallel distributed training
- n the market (June 2019)
16