Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - - PowerPoint PPT Presentation

β–Ά
distributed training on hpc
SMART_READER_LITE
LIVE PREVIEW

Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - - PowerPoint PPT Presentation

7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD Statistics Review Simple y = $ + regression Least Squares to find m,b With data set { ) , ) } )-.,..,0 Very special, often hard to


slide-1
SLIDE 1

Distributed Training on HPC

Presented By: Aaron D. Saxton, PhD

7/11/19

slide-2
SLIDE 2

Statistics Review

  • Simple y = 𝑛 $ 𝑦 + 𝑐 regression
  • Least Squares to find m,b
  • With data set { 𝑦), 𝑧) } )-.,..,0
  • Very special, often hard to measure 𝑧)
  • Let the error be
  • 𝑆 = βˆ‘)-.

0 [(𝑧) βˆ’ 𝑛 $ 𝑦) + 𝑐 ]7

  • Minimize 𝑆 with respect to 𝑛 and 𝑐.
  • Simultaneously Solve
  • 𝑆8 𝑛, 𝑐 = 0
  • 𝑆:(𝑛, 𝑐) = 0
  • Linear System
  • We will consider more general 𝑧 = 𝑔(𝑦)
  • 𝑆8 𝑛, 𝑐 = 0 and 𝑆: 𝑛, 𝑐 = 0 may not be linear

2

slide-3
SLIDE 3

Statistics Review

  • Regressions with parameterized sets of functions. e.g.
  • 𝑧 = 𝑏𝑦7 + 𝑐𝑦 + 𝑑 (quadratic)
  • 𝑧 = βˆ‘ 𝑏) 𝑦) (polynomial)
  • 𝑧 = 𝑂𝑓AB(exponential)
  • 𝑧 =

. .CDE(FGHI) (logistic)

3

slide-4
SLIDE 4

Statistics Review

  • Polynomial model of degree β€˜n’
  • β€œdegrees of freedom” - models capacity

4 Deep Learning, Goodfellow et. al., MIT Press, http://www.deeplearningbook.org, 2016

slide-5
SLIDE 5

Gradient Decent

5

  • Searching for minimum
  • 𝛼𝑆 = 𝑆KL, 𝑆KM, … , 𝑆KO
  • 𝑆 βƒ—

πœ„RC. = 𝑆 βƒ— πœ„R + 𝛿𝛼𝑆

  • 𝛿: Learning Rate
  • Recall, Loss depends on data

Expand notation,

  • 𝑆 βƒ—

πœ„R; { 𝑦), 𝑧

) } 0

  • Recall 𝑆 and 𝛼𝑆 is a sum over 𝑗
  • Intuitively, want 𝑆 with

ALL DATA ….. ? (𝑆 = βˆ‘)-.

0 [(𝑧) βˆ’ 𝑔 KW(𝑦))]7)

slide-6
SLIDE 6

Gradient Decent

6

slide-7
SLIDE 7

Stochastic Gradient Decent

7

  • Recall 𝑆 is a sum over 𝑗 (𝑆 = βˆ‘)-.

0 [(𝑧) βˆ’ 𝑔 KW(𝑦))]7)

  • Single training example, 𝑦), 𝑧) , Sum over only one training example
  • 𝛼𝑆 BX,YX = 𝑆KL, 𝑆KM, … , 𝑆KO

BX,YX

  • 𝑆 BX,YX

βƒ— πœ„RC. = 𝑆 BX,YX βƒ— πœ„R + 𝛿𝛼𝑆 BX,YX

  • 𝛿: Learning Rate
  • Choose next 𝑦)C., 𝑧)C. , (Shuffled training set)
  • SGD with mini batches
  • Many training example, 𝑦), 𝑧) , Sum over many training example
  • Batch Size or Mini Batch Size (This gets ambiguous with distributed training)
  • SGD often outperforms traditional GD, want small batches.
  • https://arxiv.org/abs/1609.04836, On Large-Batch Training … Sharp Minima
  • https://arxiv.org/abs/1711.04325, Extremely Large ... in 15 Minutes
slide-8
SLIDE 8

Neural Networks

  • Activation functions
  • Softmax
  • 𝑕[ 𝑦., 𝑦7, … , 𝑦\ =

DI] βˆ‘ DIX

8

  • 7.5
  • 5
  • 2.5
2.5 5 7.5 10
  • 2.5
2.5
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 2.5 0.5 1 1.5
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 2.5 0.5 1 1.5

𝜏 𝑦 = 𝜏 𝑦 = 𝜏 𝑦 = Logistic Arctan ReLU (Rectified Linear Unit)

slide-9
SLIDE 9

Neural Networks

9

  • Parameterized function
  • π‘Ž` = 𝜏 𝛽b8 + 𝛽8π‘Œ
  • π‘ˆe = 𝛾b[ + 𝛾[π‘Ž
  • 𝑔

e π‘Œ = 𝑕[ π‘ˆ

  • Linear Transformations with

pointwise evaluation of nonlinear function, 𝜏

  • 𝛾b), 𝛾), 𝛽b8, 𝛽8
  • Weights to be optimized

π‘Œ π‘Ž π‘ˆ β†’ 𝑍

slide-10
SLIDE 10

Faux Model Example

10

slide-11
SLIDE 11

Distributed Training, data distributed

11

slide-12
SLIDE 12

Distributed Training, data distributed

12

slide-13
SLIDE 13

Distributed Training, All Reduce Collective

13

slide-14
SLIDE 14

Distributed TensorFlow: Parameter Sever/Worker Default, Bad Way on HPC

14

ps:0

Aggregate

Update Parameters

worker:0 ps:1

Aggregate

Update Parameters

Model

Loss (Cross Entropy) Optimize (Gradient Decent)

worker:1 Model

Loss (Cross Entropy) Optimize (Gradient Decent)

worker:2 Model

Loss (Cross Entropy) Optimize (Gradient Decent)

slide-15
SLIDE 15

Other models: Sequence Modeling

  • Autoregression
  • Autocorrelation
  • Other tasks
  • Semantic Labeling

π‘ŒR = 𝑑 + βˆ‘

)-. i

𝜚)𝐢)π‘ŒR + πœ—R

Back Shift Operatior: 𝐢)

𝑆{{(𝑒., 𝑒7) = 𝐹[π‘ŒR~π‘ŒRM]

[art.] [adj.] [adj.] [n.] [v.] [adverb] [art.] [adj.] [adj.] [d.o.] The quick red fox jumps over the lazy brown dog

slide-16
SLIDE 16

Recurrent Neural Networks: Sequence Modeling

  • Few projects use pure RNNs, this

example is only for pedagogy

  • RNN is a model that is as β€œdeep” as the

modeled sequence is long

  • LSTM’s, Gated recurrent unit,
  • No Model Parallel distributed training
  • n the market (June 2019)

16