Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - - PowerPoint PPT Presentation

▶

Oct 08, 2022 261 likes •437 views

7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD Statistics Review Simple y = $ + regression Least Squares to find m,b With data set { ) , ) } )-.,..,0 Very special, often hard to

SLIDE 1

Distributed Training on HPC

Presented By: Aaron D. Saxton, PhD

7/11/19

SLIDE 2

Statistics Review

Simple y = 𝑛 $ 𝑦 + 𝑐 regression
Least Squares to find m,b
With data set { 𝑦), 𝑧) } )-.,..,0
Very special, often hard to measure 𝑧)
Let the error be
𝑆 = ∑)-.

0 [(𝑧) − 𝑛 $ 𝑦) + 𝑐 ]7

Minimize 𝑆 with respect to 𝑛 and 𝑐.
Simultaneously Solve
𝑆8 𝑛, 𝑐 = 0
𝑆:(𝑛, 𝑐) = 0
Linear System
We will consider more general 𝑧 = 𝑔(𝑦)
𝑆8 𝑛, 𝑐 = 0 and 𝑆: 𝑛, 𝑐 = 0 may not be linear

SLIDE 3

Statistics Review

Regressions with parameterized sets of functions. e.g.
𝑧 = 𝑏𝑦7 + 𝑐𝑦 + 𝑑 (quadratic)
𝑧 = ∑ 𝑏) 𝑦) (polynomial)
𝑧 = 𝑂𝑓AB(exponential)
𝑧 =

. .CDE(FGHI) (logistic)

SLIDE 4

Statistics Review

Polynomial model of degree ‘n’
“degrees of freedom” - models capacity

4 Deep Learning, Goodfellow et. al., MIT Press, http://www.deeplearningbook.org, 2016

SLIDE 5

Gradient Decent

Searching for minimum
𝛼𝑆 = 𝑆KL, 𝑆KM, … , 𝑆KO
𝑆 ⃗

𝜄RC. = 𝑆 ⃗ 𝜄R + 𝛿𝛼𝑆

𝛿: Learning Rate
Recall, Loss depends on data

Expand notation,

𝑆 ⃗

𝜄R; { 𝑦), 𝑧

) } 0

Recall 𝑆 and 𝛼𝑆 is a sum over 𝑗
Intuitively, want 𝑆 with

ALL DATA ….. ? (𝑆 = ∑)-.

0 [(𝑧) − 𝑔 KW(𝑦))]7)

SLIDE 6

Gradient Decent

SLIDE 7

Stochastic Gradient Decent

Recall 𝑆 is a sum over 𝑗 (𝑆 = ∑)-.

0 [(𝑧) − 𝑔 KW(𝑦))]7)

Single training example, 𝑦), 𝑧) , Sum over only one training example
𝛼𝑆 BX,YX = 𝑆KL, 𝑆KM, … , 𝑆KO

BX,YX

𝑆 BX,YX

⃗ 𝜄RC. = 𝑆 BX,YX ⃗ 𝜄R + 𝛿𝛼𝑆 BX,YX

𝛿: Learning Rate
Choose next 𝑦)C., 𝑧)C. , (Shuffled training set)
SGD with mini batches
Many training example, 𝑦), 𝑧) , Sum over many training example
Batch Size or Mini Batch Size (This gets ambiguous with distributed training)
SGD often outperforms traditional GD, want small batches.
https://arxiv.org/abs/1609.04836, On Large-Batch Training … Sharp Minima
https://arxiv.org/abs/1711.04325, Extremely Large ... in 15 Minutes

SLIDE 8

Neural Networks

Activation functions
Softmax
𝑕[ 𝑦., 𝑦7, … , 𝑦\ =

DI] ∑ DIX

2.5 5 7.5 10

2.5

0.5 1 1.5 2 2.5 0.5 1 1.5

𝜏 𝑦 = 𝜏 𝑦 = 𝜏 𝑦 = Logistic Arctan ReLU (Rectified Linear Unit)

SLIDE 9

Neural Networks

Parameterized function
𝑎` = 𝜏 𝛽b8 + 𝛽8𝑌
𝑈e = 𝛾b[ + 𝛾[𝑎
𝑔

e 𝑌 = 𝑕[ 𝑈

Linear Transformations with

pointwise evaluation of nonlinear function, 𝜏

𝛾b), 𝛾), 𝛽b8, 𝛽8
Weights to be optimized

𝑌 𝑎 𝑈 → 𝑍

SLIDE 10

Faux Model Example

SLIDE 11

Distributed Training, data distributed

SLIDE 12

Distributed Training, data distributed

SLIDE 13

Distributed Training, All Reduce Collective

SLIDE 14

Distributed TensorFlow: Parameter Sever/Worker Default, Bad Way on HPC

ps:0

Aggregate

Update Parameters

worker:0 ps:1

Aggregate

Update Parameters

Model

Loss (Cross Entropy) Optimize (Gradient Decent)

worker:1 Model

Loss (Cross Entropy) Optimize (Gradient Decent)

worker:2 Model

Loss (Cross Entropy) Optimize (Gradient Decent)

SLIDE 15

Other models: Sequence Modeling

Autoregression
Autocorrelation
Other tasks
Semantic Labeling

𝑌R = 𝑑 + ∑

)-. i

𝜚)𝐶)𝑌R + 𝜗R

Back Shift Operatior: 𝐶)

𝑆{{(𝑢., 𝑢7) = 𝐹[𝑌R~𝑌RM]

[art.] [adj.] [adj.] [n.] [v.] [adverb] [art.] [adj.] [adj.] [d.o.] The quick red fox jumps over the lazy brown dog

SLIDE 16

Recurrent Neural Networks: Sequence Modeling

Few projects use pure RNNs, this

example is only for pedagogy

RNN is a model that is as “deep” as the

modeled sequence is long

LSTM’s, Gated recurrent unit,
No Model Parallel distributed training
n the market (June 2019)