Deep Learning: Training Juhan Nam Training Deep Neural Networks - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Training Juhan Nam

Training Deep Neural Networks Forward (hidden unit activation) Parametric ● Gradient-based Learning Backward Non-parametric (gradient flow) . . . 𝑦 𝑚(𝑧, % 𝑧) Layer2 Layer L Layer1 Layer L-1 Layer3 Layer4 𝜖𝑚 𝜖𝑚 𝜖𝑚 (') (&) ($) 𝜖𝑥 𝜖𝑥 𝜖𝑥 !" !" !" ● Issue: vanishing gradient or exploding gradient The gradient in lower layers is computed as a cascaded multiplication of ○ local gradients from upper layers Some elements can decay or explode exponentially ○ Learning is diluted or unstable ○

Training Deep Neural Networks Forward (hidden unit activation) Parametric ● Gradient-based Learning Backward Non-parametric (gradient flow) . . . 𝑦 𝑚(𝑧, % 𝑧) Layer2 Layer L Layer1 Layer L-1 Layer3 Layer4 𝜖𝑚 𝜖𝑚 𝜖𝑚 (') (&) ($) 𝜖𝑥 𝜖𝑥 𝜖𝑥 !" !" !" ● Remedy: keep the distribution of the hidden unit activations in a controlled range Normalize the input: once as a preprocessing across the entire training set ○ Set the variance of the randomly initialized weight: once as a model setup ○ Normalize the hidden units: run-time processing during training ○ à batch normalization

Input Normalization ● Standardization: zero mean and unit variance Mean Std dev. Division Subtraction ● PCA whitening: zero mean and decorrelated unit variance Mean Rotation & Subtraction Std dev. division Add a small number to the standard deviation in the division

Input Normalization ● In practice Zero mean and unit variance are commonly used in music classification ○ tasks when the input is log-compressed spectrogram (however, in image classification, the unit variance is not very common) PCA whiting is not very common ○ ● Common pitfall The mean and standard deviation must be computed only from the training ○ data (not the entire dataset) The mean and standard deviation from the training set should be ○ consistently used for the validation and test sets

Weight Initialization ● Setting the variance of the random numbers so that the variance of input is equal to the variance of output at each layer : speed up the training ● Glorot (or Xavier) initialization (2010) The variance is set to 𝜏 ! = " #$% $% ()%*+, -)./)1#$% &'( (2+,*+, -)./) #$% !"# ( 𝑔𝑏𝑜 $&' = ) ○ ! Concerned with both the forward and backward passes ○ When the activation function is tanh or sigmoid ○ ● He initialization (2015) The variance is set to 𝜏 ! = ! ○ #$% $% Concerned with the forward pass only ○ When the activation function is ReLU or its variants ○

Batch Normalization ● Normalize the output of each layer for a mini batch input as a run-time processing during training (Ioffe and Szegedy, 2015) First, normalize the filter output to have zero-mean and unit variance for the ○ mini batch Then, rescale and shift the normalized output ○ with its trainable parameters ( 𝛾, 𝛿 ) This makes the output exploit the non-linearity: ■ the input with zero mean and unit variance are mostly in the linear range (e.g., sigmoid or tanh) 1 1 -10 10 -10 10 0 0

Batch Normalization ● Implemented as an additional layer Located between the FC (or Conv) layer and the activation function layer ○ A simple element-wise scaling and shifting operation for the input (the mean ○ and variance of the mini batch is regarded as a constant vector) ● Batch normalization in the test phase We can use a single example in the test phase ○ Use the moving average of the mean and variance of mini batches in the ○ (+,-) = 𝛽 $ 𝜈 ) (+,-) + (1 − 𝛽) $ 𝜈 ) (/01) training phase: 𝜈 ) In summary, four types of parameters are included in the batch norm layer: ○ mean (moving avg), variance (moving avg), rescaling (trainable) and shift (trainable)

Batch Normalization ● Advantages Improve the gradient flow through the networks: allowing to use a higher ○ learning rate and, as a result, the training becomes much faster! Reduce the dependence on weight initialization ○ 0.8 0.7 0.6 Inception BN − Baseline 0.5 BN − x5 BN − x30 BN − x5 − Sigmoid Steps to match Inception 0.4 5M 10M 15M 20M 25M 30M Figure 2: Single crop validation accuracy of Inception ImageNet Classification and its batch-normalized variants, vs. the number of (Ioffe and Szegedy, 2015) training steps.

Optimization ● Stochastic gradient descent is the basic optimizer in deep learning 𝑦 !"# = 𝑦 ! − 𝛽𝛼𝑔(𝑦 ! ) ● However, it has limitations �� Convergence of the loss is very slow when the loss function has high ○ condition number �� If the gradient becomes zero, the update gets stuck: local minima or saddle �� ○ �� points �� 𝛼𝑔 𝑦 $ = 0 𝛼𝑔 𝑦 $ = 0 �� Local Minimum Saddle point Condition number: the ratio of largest to smallest singular value �� of the Hessian matrix �� Source: Stanford CS231n slides ��

Deep Learning: Training Juhan Nam Training Deep Neural Networks - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Training Juhan Nam Training Deep Neural Networks Forward (hidden unit activation) Parametric Gradient-based Learning Backward Non-parametric (gradient

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep Learning With Differential Privacy Presenter: Xiaojun Xu Deep Learning Framework

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Generative vs. discriminative Generative Discriminative Belief network A is more More

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren Chalmers University of Technology

tss

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse

Deep Learning: Training Juhan Nam Training Deep Neural Networks - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Training Juhan Nam Training Deep Neural Networks Forward (hidden unit activation) Parametric Gradient-based Learning Backward Non-parametric (gradient

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep Learning With Differential Privacy Presenter: Xiaojun Xu Deep Learning Framework

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Generative vs. discriminative Generative Discriminative Belief network A is more More

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren Chalmers University of Technology

tss

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej