 
              CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.1 : A quick recap of training deep neural networks 2/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y We already saw how to train this network w = w − η ∇ w where, σ ∇ w = ∂ L ( w ) w x ∂w = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x y What about a wider network with more inputs: w 1 = w 1 − η ∇ w 1 σ w 2 = w 2 − η ∇ w 2 w 1 w 2 w 3 w 3 = w 3 − η ∇ w 3 x 2 x 1 x 3 where, ∇ w i = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x i 3/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y What if we have a deeper network ? We can now calculate ∇ w 1 using chain rule: ∂ L ( w ) = ∂ L ( w ) . ∂y .∂a 3 .∂h 2 .∂a 2 .∂h 1 . ∂a 1 σ a 3 ∂w 1 ∂y ∂a 3 ∂h 2 ∂a 2 ∂h 1 ∂a 1 ∂w 1 = ∂ L ( w ) w 3 h 2 ∗ ............... ∗ h 0 ∂y σ a 2 In general, w 2 h 1 σ ∇ w i = ∂ L ( w ) ∗ ............... ∗ h i − 1 a 1 ∂y w 1 Notice that ∇ w i is proportional to the correspond- x = h 0 ing input h i − 1 (we will use this fact later) a i = w i h i − 1 ; h i = σ ( a i ) a 1 = w 1 ∗ x = w 1 ∗ h 0 4/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y What happens if we have a network which is deep and wide? σ How do you calculate ∇ w 2 =? It will be given by chain rule applied across mul- σ tiple paths (We saw this in detail when we studied back propagation ) σ σ σ σ σ σ w 1 w 2 w 3 x 1 x 2 x 3 5/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to remember Training Neural Networks is a Game of Gradients (played using any of the existing gradient based approaches that we discussed) The gradient tells us the responsibility of a parameter towards the loss The gradient w.r.t. a parameter is proportional to the input to the parameters (recall the “ ..... ∗ x ” term or the “ .... ∗ h i ” term in the formula for ∇ w i ) 6/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y σ σ Backpropagation was made popular by Rumelhart et.al in 1986 σ σ σ However when used for really deep networks it was not very successful σ σ σ w 1 w 2 w 3 In fact, till 2006 it was very hard to x 1 x 2 x 3 train very deep networks Typically, even after a large number of epochs the training did not con- verge 7/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.2 : Unsupervised pre-training 8/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What has changed now? How did Deep Learning become so popular despite this problem with training large networks? Well, until 2006 it wasn’t so popular The field got revived after the seminal work of Hinton and Salakhutdinov in 2006 1 G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. 9/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s look at the idea of unsupervised pre-training introduced in this paper ... (note that in this paper they introduced the idea in the context of RBMs but we will discuss it in the context of Autoencoders) 10/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Consider the deep neural network shown in this figure Let us focus on the first two layers of reconstruct x the network ( x and h 1 ) x ˆ We will first train the weights between these two layers using an un- h 1 supervised objective Note that we are trying to reconstruct x the input ( x ) from the hidden repres- entation ( h 1 ) m n min 1 We refer to this as an unsupervised � � x ij − x ij ) 2 (ˆ m objective because it does not involve i =1 j =1 the output label ( y ) and only uses the input data ( x ) 11/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
At the end of this step, the weights in layer 1 are trained such that h 1 captures an abstract representation ˆ h 1 of the input x We now fix the weights in layer 1 and h 2 repeat the same process with layer 2 h 1 At the end of this step, the weights in layer 2 are trained such that h 2 cap- x tures an abstract representation of h 1 We continue this process till the last m n min 1 hidden layer ( i.e., the layer before the � � (ˆ h 1 ij − h 1 ij ) 2 m output layer) so that each successive i =1 j =1 layer captures an abstract represent- ation of the previous layer 12/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
After this layerwise pre-training, we add the output layer and train the whole network using the task specific objective Note that, in effect we have initial- ized the weights of the network us- ing the greedy unsupervised objective and are now fine tuning these weights x 1 x 2 x 3 using the supervised objective m 1 � ( y i − f ( x i )) 2 min m θ i =1 13/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better? Is it because of better optimization? Is it because of better regularization? Let’s see what these two questions mean and try to answer them based on some (among many) existing studies 1 , 2 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 2 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 14/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better? Is it because of better optimization? Is it because of better regularization? 15/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What is the optimization problem that we are trying to solve? m minimize L ( θ ) = 1 � ( y i − f ( x i )) 2 m i =1 Is it the case that in the absence of unsupervised pre-training we are not able to drive L ( θ ) to 0 even for the training data (hence poor optimization) ? Let us see this in more detail ... 16/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
The error surface of the supervised objective of a Deep Neural Network is highly non-convex With many hills and plateaus and val- leys Given that large capacity of DNNs it is still easy to land in one of these 0 error regions Indeed Larochelle et.al. 1 show that if the last layer has large capacity then L ( θ ) goes to 0 even without pre- training However, if the capacity of the net- work is small, unsupervised pre- training helps 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 17/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better? Is it because of better optimization? Is it because of better regularization? 18/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What does regularization do? It con- strains the weights to certain regions of the parameter space L-1 regularization: constrains most weights to be 0 L-2 regularization: prevents most weights from taking large values 1 Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman, Pg 71 19/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Unsupervised objective: Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space m n Ω( θ ) = 1 � � x ij ) 2 ( x ij − ˆ Specifically, it constrains the weights m to lie in regions where the character- i =1 j =1 istics of the data are captured well (as We can think of this unsupervised ob- governed by the unsupervised object- jective as an additional constraint on ive) the optimization problem This unsupervised objective ensures Supervised objective: that that the learning is not greedy w.r.t. the supervised objective (and m also satisfies the unsupervised object- L ( θ ) = 1 � ( y i − f ( x i )) 2 ive) m i =1 20/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Some other experiments have also shown that pre-training is more ro- bust to random initializations One accepted hypothesis is that pre- training leads to better weight ini- tializations (so that the layers cap- ture the internal characteristics of the data) 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 21/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
So what has happened since 2006-2009? 22/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Better weight initialization strategies 23/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.3 : Better activation functions 24/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Better weight initialization strategies 25/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Before we look at activation functions, let’s try to answer the following question: “What makes Deep Neural Networks powerful ?” 26/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Recommend
More recommend