 
              Training Deep Neural Nets Paweł Liskowski Institute of Computing Science, Poznań University of Technology 29 October 2018 1 / 20
Bias/Variance 2 / 20
Bias/Variance Train set error 3% 18% 18% 1% Test set error 14% 20% 30% 2% High bias and high variance? What if optimal error is close to e.g 15%? How to deal with bias/variance issues? High bias → use bigger network and/or train longer Repeat until you fit the training set reasonably well High variance → use regularization and/or get more data Alternative architecture may also prove effective There used to be a tradeoff between bias and variance... . . . but now we can easily drive down both Takeaway: training a bigger network almost never hurts [but...] 3 / 20
Regularization Regularization allows us to penalize overly complex models (Ockham razor). p-norm is defined as � 1 / p �� | x i | p || x || p = i L1 regularization – || w || 1 � J L 1 ( w ) = J ( w ) + λ | w i | i L2 regularization – || w || 2 2 � w 2 J L 2 ( w ) = J ( w ) + λ i i interpretation of λ ? how to find a good value of λ ? what about b ? differences between L1 and L2 4 / 20
L2 regularization Backprop with L2 regularization ∂ J L 2 = · · · + λ w ji ∂ w ji L2 regularization is frequently called weight decay [why?] w ji ← w ji − α∂ J L 2 (1) ∂ w ji Why does regularization work? large λ → weights close to zero [what happens to units?] simpler network is less prone to overfitting smaller weights → linear regime of activation function [example] activations roughly linear [linear network] 5 / 20
Dropout regularization Dropout is another popular regularization technique (a) Standard neural network (b) After applying dropout Main idea Disable each unit with certain probability p Perform forward and backward pass with such slimmed network 6 / 20
Dropout regularization Dropout is another popular regularization technique (a) Standard neural network (b) After applying dropout Why it works? On every iteration, we work with a smaller network Units can’t rely on particular features, so weights are spread out Effect: dropout shrinks weights Dropout proved to be particularly useful in computer vision [why?] Any downsides? 7 / 20
Other regularization methods Data augmentation – a way to provide a learning algorithm with more data How to augment? horizontal and (sometimes) vertical flips rotate and crop other random distortions and translations 8 / 20
Other regularization methods Early stopping – stop training before validation error starts to grow Validation Error Training Epochs Comments: prevents overfitting by not allowing weights to become large downside: mixes optimization with variance reduction 9 / 20
Vanishing/exploding gradients Gradients become small as the algorithm progresses down to the lower layers. ↓ Vanishing gradient problem: weights in lower layers are virtually unchanged. The opposite is sometimes also true ↓ Exploding gradient problem: gradients grow big, updates are insanely large. See also: "Understanding the Difficulty of Training Deep Feedforward Neural Networks", X. Glorot, Y. Bengio. 10 / 20
Xavier and He initialization One way to alleviate the problem of vanishing gradients is to use proper initialization. For the signal to flow properly: variance of the outputs should be equal to the variance of inputs gradients should have equal variance before and after flowing through a layer Typically, not possible to guarantee both. Good compromise – Xavier (Glorot) initialization: � 2 use Normal distribution with µ = 0 and σ = n inputs + n outputs Also popular – He initialization: � 2 use Normal distribution with µ = 0 and σ = n inputs 11 / 20
Nonsaturating activation functions Rectified Linear Unit (ReLU) – behaves much better than sigmoid g ( z ) = max(0 , z ) does not saturate for positive values quite fast to compute Any issues? Dying ReLU – some neurons stop outputting anything other than 0. Not a big deal in practice. 12 / 20
Nonsaturating activation functions Leaky ReLU – has a small slope for negative values, instead of zero. � 0 . 01 z if z < 0 g ( z ) = z if z ≥ 0 Parametric Leaky ReLU – α learned during training � α z if z < 0 g ( z ) = if z ≥ 0 z fixes the problem of dying ReLU may speed up learning See also: "Empirical Evaluation of Rectified Activations in Convolution Network", B. Xu et al. 13 / 20
Nonsaturating activation functions Exponential Linear Unit (ELU) – Instead of a straight line for negative values, it uses a log curve � α (exp( z ) − 1) if z < 0 g ( z ) = z if z ≥ 0 smooth everywhere faster convergence rate slower to compute than ReLU See also: "Fast and Accurate Deep Network Learning by Exponential Linear Units", D. Clevert et al. 14 / 20
Advanced optimization algorithms Momentum optimization – imagine a ball rolling down a gentle slope it starts slowly, but quickly picks up momentum regular gradient descent just takes small regular steps down the slope m ← β 1 m + α ∇ J ( w ) (2) w ← w − m common choice: β 1 = 0 . 9 Nesterov Accelerated Gradient (NAG) – measure the gradient of J slightly ahead in the direction of momentum almost always faster than vanilla momentum optimization works because momentum already points in the right direction m ← β 1 m + α ∇ J ( w + β 1 m ) (3) w ← w − m 15 / 20
Advanced optimization algorithms RMSProp – implements adaptive learning rate updates pointed more directly toward the global optimum helps to dampen oscillations requires less tuning of the learning hyperparamter α s ← β 2 s + (1 − β 2 ) ∇ J ( w ) 2 (4) w ← w − α ∇ J ( w ) √ s + ǫ Almost always works better than Adagrad (not discussed here). Converges faster than momentum and NAG Decay rate β 2 is typically set to 0 . 9. 16 / 20
Advanced optimization algorithms Adam optimization – combines momentum and RMSProp keeps track of an exponentially decaying average of past gradients and exponentially decaying average of past squared gradients m ← β 1 m + (1 − β 1 ) ∇ J ( w ) s ← β 2 s + (1 − β 2 ) ∇ J ( w ) 2 m m ← 1 − β t (5) 1 s s ← 1 − β t 2 m w ← w − α √ s + ǫ Common choice for parameters α = 0 . 001, β 1 = 0 . 9, β 2 = 0 . 999, ǫ = 10 − 8 Currently the default choice for many problems. 17 / 20
Batch normalization Normalizing input features speeds up learning, i.e.: X ← X − µ (6) σ 2 What about activations deeper in the network? Idea: normalize Z [ L − 1] to make learning of W [ L ] more efficient. For l -th layer: µ ← 1 � z ( i ) m i σ 2 ← 1 ( z ( i ) − µ ) 2 � m (7) i z ( i ) z ( i ) norm ← √ σ 2 + ǫ z ( i ) ← γ z ( i ) ˆ norm + β Parameters γ and β are trained via backprop. √ σ 2 + ǫ and β = µ ? What happens when γ = See also: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", S. Ioffe et al. 18 / 20
Batch normalization Comments: at test time use the whole training set’s mean and standard deviation batch norm significantly reduces vanishing gradient problem network is less sensitive to weights initialization much larger learning rates can be used downsite: adds complexity and runtime penalty 19 / 20
Mini batch gradient descent Variation of the gradient descent algorithm that splits the training dataset into small batches seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent, most common implementation of gradient descent used in the field of deep learning. typical batch sizes: 32, 64, . . . , 512. Advantages: more frequent updates → more robust convergence allows efficienty process datasets that do not fit the memory 20 / 20
Recommend
More recommend