CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, - PowerPoint PPT Presentation

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

The input to the network is an n -dimensional h L = ˆ y = f ( x ) vector The network contains L − 1 hidden layers (2, in this case) having n neurons each a 3 W 3 Finally, there is one output layer containing k b 3 h 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a 2 can be split into two parts : pre-activation and W 2 b 2 activation ( a i and h i are vectors) h 1 The input layer can be called the 0-th layer and the output layer can be called the ( L )-th layer a 1 W i ∈ R n × n and b i ∈ R n are the weight and bias W 1 b 1 between layers i − 1 and i (0 < i < L ) W L ∈ R n × k and b L ∈ R k are the weight and bias x 1 x 2 x n between the last hidden layer and the output layer ( L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) The pre-activation at layer i is given by a i ( x ) = b i + W i h i − 1 ( x ) a 3 W 3 The activation at layer i is given by b 3 h 2 h i ( x ) = g ( a i ( x )) a 2 where g is called the activation function (for W 2 b 2 example, logistic, tanh, linear, etc. ) h 1 The activation at the output layer is given by a 1 f ( x ) = h L ( x ) = O ( a L ( x )) W 1 b 1 where O is the output activation function (for x 1 x 2 x n example, softmax, linear, etc. ) To simplify notation we will refer to a i ( x ) as a i and h i ( x ) as h i 5/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) The pre-activation at layer i is given by a i = b i + W i h i − 1 a 3 W 3 The activation at layer i is given by b 3 h 2 h i = g ( a i ) a 2 where g is called the activation function (for W 2 b 2 example, logistic, tanh, linear, etc. ) h 1 The activation at the output layer is given by a 1 f ( x ) = h L = O ( a L ) W 1 b 1 where O is the output activation function (for x 1 x 2 x n example, softmax, linear, etc. ) 6/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) Data: { x i , y i } N i =1 Model: a 3 y i = f ( x i ) = O ( W 3 g ( W 2 g ( W 1 x + b 1 ) + b 2 ) + b 3 ) ˆ W 3 b 3 h 2 Parameters: θ = W 1 , .., W L , b 1 , b 2 , ..., b L ( L = 3) a 2 W 2 Algorithm: Gradient Descent with Back- b 2 h 1 propagation (we will see soon) Objective/Loss/Error function: Say, a 1 N k min 1 � � W 1 y ij − y ij ) 2 b 1 (ˆ N i =1 j =1 x 1 x 2 x n In general, min L ( θ ) where L ( θ ) is some function of the parameters 7/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition) 8/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model 9/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) Recall our gradient descent algorithm Algorithm: gradient descent() a 3 t ← 0; W 3 max iterations ← 1000; b 3 h 2 w 0 , b 0 ; Initialize while t ++ < max iterations do w t +1 ← w t − η ∇ w t ; a 2 W 2 b t +1 ← b t − η ∇ b t ; b 2 h 1 end a 1 W 1 b 1 x 1 x 2 x n 10/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() a 3 W 3 t ← 0; b 3 h 2 max iterations ← 1000; Initialize θ 0 = [ w 0 , b 0 ]; while t ++ < max iterations do a 2 W 2 θ t +1 ← θ t − η ∇ θ t ; b 2 h 1 end � ∂ L ( θ ) � T ∂w t , ∂ L ( θ ) where ∇ θ t = a 1 ∂b t Now, in this feedforward neural network, W 1 b 1 instead of θ = [ w, b ] we have θ = x 1 x 2 x n [ W 1 , W 2 , .., W L , b 1 , b 2 , .., b L ] We can still use the same algorithm for learning the parameters of our model 11/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() a 3 W 3 t ← 0; b 3 h 2 max iterations ← 1000; θ 0 = [ W 0 1 , ..., W 0 L , b 0 1 , ..., b 0 Initialize L ]; while t ++ < max iterations do a 2 W 2 θ t +1 ← θ t − η ∇ θ t ; b 2 h 1 end � ∂ L ( θ ) � T ∂W 1 ,t , ., ∂ L ( θ ) ∂W L,t , ∂ L ( θ ) ∂b 1 ,t , ., ∂ L ( θ ) where ∇ θ t = a 1 ∂b L,t Now, in this feedforward neural network, W 1 b 1 instead of = [ w, b ] we have = θ θ [ W 1 , W 2 , .., W L , b 1 , b 2 , .., b L ] x 1 x 2 x n We can still use the same algorithm for learning the parameters of our model 12/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Except that now our ∇ θ looks much more nasty   ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 111 . . . ∂W 211 . . . ∂W 21 n . . . ∂W L, 11 . . . . . . ∂W 11 n ∂W L, 1 k ∂W L, 1 k ∂b 11 ∂b L 1       ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 121 . . . ∂W 221 . . . ∂W 22 n . . . ∂W L, 21 . . . . . .   ∂W 12 n ∂W L, 2 k ∂W L, 2 k ∂b 12 ∂b L 2   . . . . . . . . . . . . . .   . . . . . . . . . . . . . .   . . . . . . . . . . . . . .   ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 1 n 1 . . . ∂W 2 n 1 . . . ∂W 2 nn . . . ∂W L,n 1 . . . . . . ∂W 1 nn ∂W L,nk ∂W L,nk ∂b 1 n ∂b Lk ∇ θ is thus composed of ∇ W 1 , ∇ W 2 , ... ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k , ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k 13/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

We need to answer two questions How to choose the loss function L ( θ )? How to compute ∇ θ which is composed of ∇ W 1 , ∇ W 2 , ..., ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k ? 14/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Module 4.3: Output Functions and Loss Functions 15/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

We need to answer two questions How to choose the loss function L ( θ ) ? How to compute ∇ θ which is composed of: ∇ W 1 , ∇ W 2 , ..., ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k ? 16/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

The choice of loss function depends y i = { 7 . 5 8 . 2 7 . 7 } on the problem at hand We will illustrate this with the help imdb Critics RT of two examples Rating Rating Rating Consider our movie example again but this time we are interested in predicting ratings Neural network with Here y i ∈ R 3 L − 1 hidden layers The loss function should capture how much ˆ y i deviates from y i If y i ∈ R n then the squared error loss can capture this deviation N 3 isActor isDirector L ( θ ) = 1 � � . . . . . . . . . . y ij − y ij ) 2 (ˆ Damon Nolan N i =1 j =1 x i 17/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

h L = ˆ y = f ( x ) A related question: What should the output function ‘ O ’ be if y i ∈ R ? More specifically, can it be the logistic a 3 function? W 3 b 3 h 2 No, because it restricts ˆ y i to a value between 0 & 1 but we want ˆ y i ∈ R So, in such cases it makes sense to a 2 W 2 have ‘ O ’ as linear function b 2 h 1 f ( x ) = h L = O ( a L ) = W O a L + b O a 1 W 1 b 1 ˆ y i = f ( x i ) is no longer bounded between 0 and 1 x 1 x 2 x n 18/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Intentionally left blank 19/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Now let us consider another problem y = [1 0 0 0] for which a different loss function would be appropriate Apple Mango Orange Banana Suppose we want to classify an image into 1 of k classes Here again we could use the squared Neural network with error loss to capture the deviation L − 1 hidden layers But can you think of a better function? 20/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Notice that is a probability y y = [1 0 0 0] distribution Apple Mango Orange Therefore we should also ensure that Banana ˆ y is a probability distribution What choice of the output activation ‘ O ’ will ensure this ? Neural network with a L = W L h L − 1 + b L L − 1 hidden layers e a L,j ˆ y j = O ( a L ) j = � k i =1 e a L,i O ( a L ) j is the j th element of ˆ y and a L,j is the j th element of the vector a L . This function is called the softmax function 21/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 h L = ˆ y = f ( x )

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, - PowerPoint PPT Presentation

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

Application of Application of Loss Metrics to Loss Metrics to Multimedia Multimedia Streams

Pre- and post-quantum DiffieHellman from groups, actions, and isogenies Benjamin Smith

The semiclassical method in interacting many body systems Path integrals, fields and particles

Introduction to Electronic Structure Theory Mikael Johansson http://www.iki.fi/mpjohans Objective

Summary of test results of MQXFS1 the first short model 150 mm aperture Nb 3 Sn quadrupole for

Deriving SGD for Neural Networks Swarthmore College CS63 Spring 2018 A neural network NN computes

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July

Optimizing the Partial AUC Harikrishna Narasimhan and Shivani Agarwal Department of Computer