Deep Learning: Training Juhan Nam Training Deep Neural Networks - - PowerPoint PPT Presentation

deep learning training
SMART_READER_LITE
LIVE PREVIEW

Deep Learning: Training Juhan Nam Training Deep Neural Networks - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Training Juhan Nam Training Deep Neural Networks Forward (hidden unit activation) Parametric Gradient-based Learning Backward Non-parametric (gradient


slide-1
SLIDE 1

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Deep Learning: Training

slide-2
SLIDE 2

Training Deep Neural Networks

  • Gradient-based Learning
  • Issue: vanishing gradient or exploding gradient

○ The gradient in lower layers is computed as a cascaded multiplication of local gradients from upper layers ○ Some elements can decay or explode exponentially ○ Learning is diluted or unstable

𝜖𝑚 𝜖𝑥

!" ($)

Forward (hidden unit activation) Backward (gradient flow) Layer1 Layer2 Layer4 Layer L

. . .

Layer L-1 Layer3

𝑚(𝑧, % 𝑧)

𝜖𝑚 𝜖𝑥

!" (&)

𝜖𝑚 𝜖𝑥

!" (')

𝑦

Parametric Non-parametric

slide-3
SLIDE 3

Training Deep Neural Networks

  • Gradient-based Learning
  • Remedy: keep the distribution of the hidden unit activations in a

controlled range

○ Normalize the input: once as a preprocessing across the entire training set ○ Set the variance of the randomly initialized weight: once as a model setup ○ Normalize the hidden units: run-time processing during training à batch normalization

𝜖𝑚 𝜖𝑥

!" ($)

Layer1 Layer2 Layer4 Layer L

. . .

Layer L-1 Layer3

𝑚(𝑧, % 𝑧)

𝜖𝑚 𝜖𝑥

!" (&)

𝜖𝑚 𝜖𝑥

!" (')

𝑦

Forward (hidden unit activation) Backward (gradient flow) Parametric Non-parametric

slide-4
SLIDE 4
  • Standardization: zero mean and unit variance
  • PCA whitening: zero mean and decorrelated unit variance

Input Normalization

Mean Subtraction Std dev. Division Mean Subtraction Rotation & Std dev. division

Add a small number to the standard deviation in the division

slide-5
SLIDE 5

Input Normalization

  • In practice

○ Zero mean and unit variance are commonly used in music classification tasks when the input is log-compressed spectrogram (however, in image classification, the unit variance is not very common) ○ PCA whiting is not very common

  • Common pitfall

○ The mean and standard deviation must be computed only from the training data (not the entire dataset) ○ The mean and standard deviation from the training set should be consistently used for the validation and test sets

slide-6
SLIDE 6

Weight Initialization

  • Setting the variance of the random numbers so that the variance of input

is equal to the variance of output at each layer: speed up the training

  • Glorot (or Xavier) initialization (2010)

○ The variance is set to 𝜏! =

" #$%!"# (𝑔𝑏𝑜$&' = #$%$% ()%*+, -)./)1#$%&'((2+,*+, -)./) !

) ○ Concerned with both the forward and backward passes ○ When the activation function is tanh or sigmoid

  • He initialization (2015)

○ The variance is set to 𝜏! =

! #$%$%

○ Concerned with the forward pass only ○ When the activation function is ReLU or its variants

slide-7
SLIDE 7

Batch Normalization

  • Normalize the output of each layer for a mini batch input as a run-time

processing during training (Ioffe and Szegedy, 2015)

○ First, normalize the filter output to have zero-mean and unit variance for the mini batch ○ Then, rescale and shift the normalized output with its trainable parameters (𝛾, 𝛿)

■ This makes the output exploit the non-linearity: the input with zero mean and unit variance are mostly in the linear range (e.g., sigmoid or tanh)

1 10

  • 10

1 10

  • 10
slide-8
SLIDE 8

Batch Normalization

  • Implemented as an additional layer

○ Located between the FC (or Conv) layer and the activation function layer ○ A simple element-wise scaling and shifting operation for the input (the mean and variance of the mini batch is regarded as a constant vector)

  • Batch normalization in the test phase

○ We can use a single example in the test phase ○ Use the moving average of the mean and variance of mini batches in the training phase: 𝜈)

(+,-) = 𝛽 $ 𝜈) (+,-) + (1 − 𝛽) $ 𝜈) (/01)

○ In summary, four types of parameters are included in the batch norm layer: mean (moving avg), variance (moving avg), rescaling (trainable) and shift (trainable)

slide-9
SLIDE 9

Batch Normalization

  • Advantages

○ Improve the gradient flow through the networks: allowing to use a higher learning rate and, as a result, the training becomes much faster! ○ Reduce the dependence on weight initialization

5M 10M 15M 20M 25M 30M 0.4 0.5 0.6 0.7 0.8 Inception BN−Baseline BN−x5 BN−x30 BN−x5−Sigmoid Steps to match Inception

Figure 2: Single crop validation accuracy of Inception and its batch-normalized variants, vs. the number of training steps. ImageNet Classification (Ioffe and Szegedy, 2015)

slide-10
SLIDE 10

Optimization

  • Stochastic gradient descent is the basic optimizer in deep learning
  • However, it has limitations

○ Convergence of the loss is very slow when the loss function has high condition number ○ If the gradient becomes zero, the update gets stuck: local minima or saddle points

  • Condition number: the ratio of largest to smallest singular value
  • f the Hessian matrix

Local Minimum Saddle point

Source: Stanford CS231n slides

𝑦!"# = 𝑦! − 𝛽𝛼𝑔(𝑦!)

𝛼𝑔 𝑦$ = 0 𝛼𝑔 𝑦$ = 0

slide-11
SLIDE 11
  • Momentum
  • Update the parameters using not only the current gradient but also the

previous history

○ Adding inertia (or gravity) to the parameter point ○ Analogous to classical mechanicsà 𝑦: displacement, 𝑤: velocity

𝑦!"# = 𝑦! + 𝑤!"# 𝑤!"# = 𝜍𝑤! − 𝛽𝛼𝑔(𝑦!)

Accumulated gradients Current gradients

  • Local Minimum

Saddle point

𝛼𝑔 𝑦$ = 0 𝛼𝑔 𝑦$ = 0

Source: Stanford CS231n slides

slide-12
SLIDE 12

Nesterov Momentum

  • Faster convergence by computing the gradient at the tip of the velocity

○ At the current parameter point 𝑦,, we know that the gradient from the history 𝜍𝑤, will be added to it. Thus, we compute the local gradient at the anticipated point 𝑦, + 𝜍𝑤,

𝑦!"# = 𝑦! + 𝑤!"# 𝑤!"# = 𝜍𝑤! − 𝛽𝛼𝑔(𝑦! + 𝜍𝑤!)

Accumulated gradients Advanced gradients 𝜍𝑤$ 𝑦$ 𝑦$ + 𝜍𝑤$ −𝛽𝛼𝑔(𝑦$) 𝑦$ + 𝑤$%& 𝑤$%& Momentum Nesterov Momentum −𝛽𝛼𝑔(𝑦$ + 𝜍𝑤$) 𝜍𝑤$ 𝑦$ 𝑦$ + 𝜍𝑤$ 𝑤$%& 𝑦$ + 𝑤$%&

slide-13
SLIDE 13

Per-Parameter Optimization

  • Every parameter has a different learning rate
  • AdaGrad (Duchi, 2011)

○ Use an adaptive learning rate for each parameter ○ Increase the learning rate for less updated parameters and vice versa

𝑦!"# = 𝑦! − 𝛽𝛼𝑔(𝑦!) 𝑦!"#(𝑗) = 𝑦!(𝑗) − 𝛽(𝑗)𝛼𝑔(𝑦!(𝑗)) 𝑦!"#(𝑗) = 𝑦!(𝑗) − 𝛽 𝑕(𝑗) + 𝜗 𝛼𝑔(𝑦!(𝑗)) 𝑕(𝑗) = /

!

𝛼𝑔(𝑦!(𝑗))$

slide-14
SLIDE 14

Per-Parameter Optimization

  • RMSProp (Tieleman and Hinton, 2012)

○ Fix the continuously growing 𝑕(𝑗) in AdaGrad using a moving average

  • ADAM (ADAptive Momentum estimation)

○ Put it all together: momentum + per-parameter learning rate (RMSProp) ○ Most widely used optimizer

𝑤!"# = 𝜍𝑤! + (1 − 𝜍)𝛼𝑔(𝑦!) 𝑦!"#(𝑗) = 𝑦!(𝑗) − 𝛽 𝑕!"#(𝑗) + 𝜗 𝑤!"# 𝑕!"#(𝑗) = 𝛾𝑕!(𝑗) + (1 − 𝛾)𝛼𝑔(𝑦!(𝑗))$ 𝑕!"#(𝑗) = 𝛾𝑕!(𝑗) + (1 − 𝛾)𝛼𝑔(𝑦!(𝑗))$ 𝑦!"#(𝑗) = 𝑦!(𝑗) − 𝛽 𝑕!"#(𝑗) + 𝜗 𝛼𝑔(𝑦!(𝑗))

slide-15
SLIDE 15

Optimizer Animation

  • Comparison of optimizers
  • Also, check this interactive article “why momentum really works” in Distill

○ https://distill.pub/2017/momentum/

Source: https://rnrahman.com/blog/visualising-stochastic-optimisers/

slide-16
SLIDE 16

Annealing Learning Rate

  • Decay the learning rate under certain conditions

○ Step decay: by a factor (e.g. 5 or 10) every fixed size of epoch

■ Exponential decay (𝛽 = 𝛽%𝑓&'! ) or 1/t decay (𝛽 = 3

(' (#"'!) ) is also possible

○ Reduce on plateau: decay around every early stopping point

  • Reset the learning rate with “warm start”

○ Cosine (Loshchilov, 2017) and cyclic (Smith, 2017) ○ Start with a high learning rate and restart with better initial weights

Epoch

Loss Decay the learning rate

Epoch

Loss Reset the learning rate (new start!)

slide-17
SLIDE 17

Regularization

  • The deep learning models can easily overfit to the training data

Epoch

Loss

Overfitting Start Validation Training Epoch

Loss

Validation Training Overfitting Start

Use early stopping Use weight decay, dropout or data augmentation

slide-18
SLIDE 18

Dropout

  • Turn off the hidden layer units randomly in each forward pass

(Srivastava et. al, 2014)

○ The binary mask is Implemented by multiplying the hidden units with binary random variables ○ The probability is a hyper-parameter

slide-19
SLIDE 19

Dropout

  • Prevent co-adaptation of hidden units

○ Co-adaptation: two or more hidden units detect the same features ○ This waste of resources can be prevented by the dropout

  • Dropout enables a large ensemble of models

○ The ensemble reduces the variance of the model ○ An fully-connected layer with 1024 units has 2^1024 combinations of masks (They share the parameters)

slide-20
SLIDE 20

Data Augmentation

  • Increasing the quantity of input data based on domain knowledge
  • Commonly used digital audio effects

○ Pitch shifting ○ Time-stretching ○ Equalization ○ Adding noises

  • Check if the output label is affected by the audio effects

○ You can also change the label according to the way of augmenting data

slide-21
SLIDE 21

In Summary: Recommended Settings

  • Data

○ Standardization, augmentation (optional)

  • Build a neural network model

○ Add batch normalization ○ He initialization ○ Dropout, Weight decay (optional)

  • Optimizer

○ ADAM or SGD with Nesterov Momentum

  • Training

○ Early stopping ○ Annealing