Training Deep Neural Nets Pawe Liskowski Institute of Computing - - PowerPoint PPT Presentation

training deep neural nets
SMART_READER_LITE
LIVE PREVIEW

Training Deep Neural Nets Pawe Liskowski Institute of Computing - - PowerPoint PPT Presentation

Training Deep Neural Nets Pawe Liskowski Institute of Computing Science, Pozna University of Technology 29 October 2018 1 / 20 Bias/Variance 2 / 20 Bias/Variance Train set error 3% 18% 18% 1% Test set error 14% 20% 30% 2% High


slide-1
SLIDE 1

Training Deep Neural Nets

Paweł Liskowski

Institute of Computing Science, Poznań University of Technology

29 October 2018

1 / 20

slide-2
SLIDE 2

Bias/Variance

2 / 20

slide-3
SLIDE 3

Bias/Variance

Train set error 3% 18% 18% 1% Test set error 14% 20% 30% 2% High bias and high variance? What if optimal error is close to e.g 15%? How to deal with bias/variance issues? High bias → use bigger network and/or train longer

Repeat until you fit the training set reasonably well

High variance → use regularization and/or get more data Alternative architecture may also prove effective There used to be a tradeoff between bias and variance... . . . but now we can easily drive down both Takeaway: training a bigger network almost never hurts [but...]

3 / 20

slide-4
SLIDE 4

Regularization

Regularization allows us to penalize overly complex models (Ockham razor). p-norm is defined as ||x||p =

  • i

|xi|p 1/p L1 regularization – ||w||1 JL1(w) = J(w) + λ

  • i

|wi| L2 regularization – ||w||2

2

JL2(w) = J(w) + λ

  • i

w 2

i

interpretation of λ? how to find a good value of λ? what about b? differences between L1 and L2

4 / 20

slide-5
SLIDE 5

L2 regularization

Backprop with L2 regularization ∂JL2 ∂wji = · · · + λwji L2 regularization is frequently called weight decay [why?] wji ← wji − α∂JL2 ∂wji (1) Why does regularization work? large λ → weights close to zero [what happens to units?] simpler network is less prone to overfitting smaller weights → linear regime of activation function [example] activations roughly linear [linear network]

5 / 20

slide-6
SLIDE 6

Dropout regularization

Dropout is another popular regularization technique

(a) Standard neural network (b) After applying dropout

Main idea Disable each unit with certain probability p Perform forward and backward pass with such slimmed network

6 / 20

slide-7
SLIDE 7

Dropout regularization

Dropout is another popular regularization technique

(a) Standard neural network (b) After applying dropout

Why it works? On every iteration, we work with a smaller network Units can’t rely on particular features, so weights are spread out Effect: dropout shrinks weights Dropout proved to be particularly useful in computer vision [why?] Any downsides?

7 / 20

slide-8
SLIDE 8

Other regularization methods

Data augmentation – a way to provide a learning algorithm with more data How to augment? horizontal and (sometimes) vertical flips rotate and crop

  • ther random distortions and translations

8 / 20

slide-9
SLIDE 9

Other regularization methods

Early stopping – stop training before validation error starts to grow

Epochs Error Validation Training

Comments: prevents overfitting by not allowing weights to become large downside: mixes optimization with variance reduction

9 / 20

slide-10
SLIDE 10

Vanishing/exploding gradients

Gradients become small as the algorithm progresses down to the lower layers.

Vanishing gradient problem: weights in lower layers are virtually unchanged. The opposite is sometimes also true

Exploding gradient problem: gradients grow big, updates are insanely large. See also: "Understanding the Difficulty of Training Deep Feedforward Neural Networks", X. Glorot, Y. Bengio.

10 / 20

slide-11
SLIDE 11

Xavier and He initialization

One way to alleviate the problem of vanishing gradients is to use proper initialization. For the signal to flow properly: variance of the outputs should be equal to the variance of inputs gradients should have equal variance before and after flowing through a layer Typically, not possible to guarantee both. Good compromise – Xavier (Glorot) initialization: use Normal distribution with µ = 0 and σ =

  • 2

ninputs+noutputs

Also popular – He initialization: use Normal distribution with µ = 0 and σ =

  • 2

ninputs

11 / 20

slide-12
SLIDE 12

Nonsaturating activation functions

Rectified Linear Unit (ReLU) – behaves much better than sigmoid g(z) = max(0, z) does not saturate for positive values quite fast to compute Any issues? Dying ReLU – some neurons stop outputting anything other than 0. Not a big deal in practice.

12 / 20

slide-13
SLIDE 13

Nonsaturating activation functions

Leaky ReLU – has a small slope for negative values, instead of zero. g(z) =

  • 0.01z

if z < 0 z if z ≥ 0 Parametric Leaky ReLU – α learned during training g(z) =

  • αz

if z < 0 z if z ≥ 0 fixes the problem of dying ReLU may speed up learning See also: "Empirical Evaluation of Rectified Activations in Convolution Network",

  • B. Xu et al.

13 / 20

slide-14
SLIDE 14

Nonsaturating activation functions

Exponential Linear Unit (ELU) – Instead of a straight line for negative values, it uses a log curve g(z) =

  • α(exp(z) − 1)

if z < 0 z if z ≥ 0 smooth everywhere faster convergence rate slower to compute than ReLU See also: "Fast and Accurate Deep Network Learning by Exponential Linear Units", D. Clevert et al.

14 / 20

slide-15
SLIDE 15

Advanced optimization algorithms

Momentum optimization – imagine a ball rolling down a gentle slope it starts slowly, but quickly picks up momentum regular gradient descent just takes small regular steps down the slope m ← β1m + α∇J(w) w ← w − m (2) common choice: β1 = 0.9 Nesterov Accelerated Gradient (NAG) – measure the gradient of J slightly ahead in the direction of momentum almost always faster than vanilla momentum optimization works because momentum already points in the right direction m ← β1m + α∇J(w + β1m) w ← w − m (3)

15 / 20

slide-16
SLIDE 16

Advanced optimization algorithms

RMSProp – implements adaptive learning rate updates pointed more directly toward the global optimum helps to dampen oscillations requires less tuning of the learning hyperparamter α s ← β2s + (1 − β2)∇J(w)2 w ← w − α∇J(w) √s + ǫ (4) Almost always works better than Adagrad (not discussed here). Converges faster than momentum and NAG Decay rate β2 is typically set to 0.9.

16 / 20

slide-17
SLIDE 17

Advanced optimization algorithms

Adam optimization – combines momentum and RMSProp keeps track of an exponentially decaying average of past gradients and exponentially decaying average of past squared gradients m ← β1m + (1 − β1)∇J(w) s ← β2s + (1 − β2)∇J(w)2 m ← m 1 − βt

1

s ← s 1 − βt

2

w ← w − α m √s + ǫ (5) Common choice for parameters α = 0.001, β1 = 0.9, β2 = 0.999, ǫ = 10−8 Currently the default choice for many problems.

17 / 20

slide-18
SLIDE 18

Batch normalization

Normalizing input features speeds up learning, i.e.: X ← X − µ σ2 (6) What about activations deeper in the network? Idea: normalize Z [L−1] to make learning of W [L] more efficient. For l-th layer: µ ← 1 m

  • i

z(i) σ2 ← 1 m

  • i

(z(i) − µ)2 z(i)

norm ←

z(i) √ σ2 + ǫ ˆ z(i) ← γz(i)

norm + β

(7) Parameters γ and β are trained via backprop. What happens when γ = √ σ2 + ǫ and β = µ? See also: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", S. Ioffe et al.

18 / 20

slide-19
SLIDE 19

Batch normalization

Comments: at test time use the whole training set’s mean and standard deviation batch norm significantly reduces vanishing gradient problem network is less sensitive to weights initialization much larger learning rates can be used downsite: adds complexity and runtime penalty

19 / 20

slide-20
SLIDE 20

Mini batch gradient descent

Variation of the gradient descent algorithm that splits the training dataset into small batches seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent, most common implementation of gradient descent used in the field of deep learning. typical batch sizes: 32, 64, . . . , 512. Advantages: more frequent updates → more robust convergence allows efficienty process datasets that do not fit the memory

20 / 20