cs7015 deep learning lecture 5
play

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, RMSProp, Adam Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1


  1. CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, RMSProp, Adam Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  2. Acknowledgements For most of the lecture, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube) Some content is based on the course CS231n a by Andrej Karpathy and others a http://cs231n.stanford.edu/2016/ 2 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  3. Module 5.1: Learning Parameters : Infeasible (Guess Work) 3 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  4. x σ y = f ( x ) Input for training { x i , y i } N i =1 → N pairs of ( x, y ) 1 1 f ( x ) = Training objective 1+ e − ( w · x + b ) Find w and b such that: N � ( y i − f ( x i )) 2 minimize L ( w, b ) = w,b i =1 What does it mean to train the network? Suppose we train the network with ( x, y ) = (0 . 5 , 0 . 2) and (2 . 5 , 0 . 9) At the end of training we expect to find w ∗ , b ∗ such that: f (0 . 5) → 0 . 2 and f (2 . 5) → 0 . 9 4 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5 In other words...

  5. x σ y = f ( x ) In other words... We hope to find a sigmoid function 1 such that (0 . 5 , 0 . 2) and (2 . 5 , 0 . 9) lie on this sigmoid 1 f ( x ) = 1+ e − ( w · x + b ) 5 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  6. Let us see this in more detail.... 6 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  7. Can we try to find such a w ∗ , b ∗ manually Let us try a random guess.. (say, w = 0 . 5 , b = 0) Clearly not good, but how bad is it ? Let us revisit L ( w, b ) to see how bad it is ... N L ( w, b ) = 1 � ( y i − f ( x i )) 2 2 ∗ i =1 = 1 2 ∗ (( y 1 − f ( x 1 )) 2 + ( y 2 − f ( x 2 )) 2 ) = 1 2 ∗ ((0 . 9 − f (2 . 5)) 2 + (0 . 2 − f (0 . 5)) 2 ) = 0 . 073 We want L ( w, b ) to be as close to 0 as possible 7 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  8. Let us try some other values of w , b w b L ( w, b ) 0.50 0.00 0.0730 -0.10 0.00 0.1481 0.94 -0.94 0.0214 1.42 -1.73 0.0028 1.65 -2.08 0.0003 1.78 -2.27 0.0000 Oops!! this made things even worse... Perhaps it would help to push w and b in the other direction... 8 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  9. Let us look at something better than our “guess work” algorithm.... 9 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  10. Since we have only 2 points and 2 parameters ( w , b ) we can easily plot L ( w, b ) for different values of ( w , b ) and pick the one where L ( w, b ) is minimum But of course this becomes intract- able once you have many more data points and many more parameters !! Further, even here we have plotted the error surface only for a small range of ( w , b ) [from ( − 6 , 6) and not from ( − inf , inf)] 10 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  11. Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface 11 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  12. 12 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  13. 13 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  14. 14 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  15. 15 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  16. 16 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  17. 17 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  18. 18 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  19. Module 5.2: Learning Parameters : Gradient Descent 19 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  20. Now let’s see if there is a more efficient and principled way of doing this 20 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  21. Goal Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search! 21 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  22. vector of parameters, say, randomly initial- ized We moved in the direc- θ = [ w, b ] θ new θ tion of ∆ θ ∆ θ = [∆ w, ∆ b ] η · ∆ θ ∆ θ Let us be a bit conservat- change in the ive: move only by a small values of w , b amount η θ new = θ + η · ∆ θ Question: What is the right ∆ θ to use? The answer comes from Taylor series 22 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  23. For ease of notation, let ∆ θ = u , then from Taylor series, we have, L ( θ + ηu ) = L ( θ ) + η ∗ u T ∇ L ( θ ) + η 2 2! ∗ u T ∇ 2 L ( θ ) u + η 3 3! ∗ ... + η 4 4! ∗ ... = L ( θ ) + η ∗ u T ∇ L ( θ ) [ η is typically small, so η 2 , η 3 , ... → 0] Note that the move ( ηu ) would be favorable only if, L ( θ + ηu ) − L ( θ ) < 0 [ i.e., if the new loss is less than the previous loss ] This implies, u T ∇ L ( θ ) < 0 23 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  24. Okay, so we have, u T ∇ L ( θ ) < 0 But, what is the range of u T ∇ L ( θ ) ? Let’s see.... Let β be the angle between u T and ∇ L ( θ ), then we know that, u T ∇ L ( θ ) − 1 ≤ cos ( β ) = || u || ∗ ||∇ L ( θ ) || ≤ 1 Multiply throughout by k = || u || ∗ ||∇ L ( θ ) || − k ≤ k ∗ cos ( β ) = u T ∇ L ( θ ) ≤ k Thus, L ( θ + ηu ) − L ( θ ) = u T ∇ L ( θ ) = k ∗ cos ( β ) will be most negative when cos ( β ) = − 1 i.e. , when β is 180 ◦ 24 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  25. Gradient Descent Rule The direction u that we intend to move in should be at 180 ◦ w.r.t. the gradient In other words, move in a direction opposite to the gradient Parameter Update Equations w t +1 = w t − η ∇ w t b t +1 = b t − η ∇ b t where, ∇ w t = ∂ L ( w, b ) , ∇ b t = ∂ L ( w, b ) ∂w ∂b at w = w t , b = b t at w = w t , b = b t So we now have a more principled way of moving in the w - b plane than our “guess work” algorithm 25 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  26. Let’s create an algorithm from this rule ... Algorithm 1: gradient descent() t ← 0; max iterations ← 1000; while t < max iterations do w t +1 ← w t − η ∇ w t ; b t +1 ← b t − η ∇ b t ; end To see this algorithm in practice let us first derive ∇ w and ∇ b for our toy neural network 26 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  27. x σ y = f ( x ) 1 Let’s assume there is only 1 point to fit 1 f ( x ) = ( x, y ) 1+ e − ( w · x + b ) L ( w, b ) = 1 2 ∗ ( f ( x ) − y ) 2 ∇ w = ∂ L ( w, b ) = ∂ ∂w [1 2 ∗ ( f ( x ) − y ) 2 ] ∂w 27 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  28. ∇ w = ∂ ∂w [1 ∂ 1 � � 2 ∗ ( f ( x ) − y ) 2 ] 1 + e − ( wx + b ) ∂w = 1 2 ∗ [2 ∗ ( f ( x ) − y ) ∗ ∂ − 1 ∂ ∂w ( e − ( wx + b ) )) ∂w ( f ( x ) − y )] = (1 + e − ( wx + b ) ) 2 = ( f ( x ) − y ) ∗ ∂ (1 + e − ( wx + b ) ) 2 ∗ ( e − ( wx + b ) ) ∂ − 1 ∂w ( f ( x )) = ∂w ( − ( wx + b ))) 1 = ( f ( x ) − y ) ∗ ∂ � � e − ( wx + b ) − 1 1 + e − ( wx + b ) ∂w = (1 + e − ( wx + b ) ) ∗ (1 + e − ( wx + b ) ) ∗ ( − x ) = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x e − ( wx + b ) 1 = (1 + e − ( wx + b ) ) ∗ (1 + e − ( wx + b ) ) ∗ ( x ) = f ( x ) ∗ (1 − f ( x )) ∗ x 28 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  29. x σ y = f ( x ) 1 So if there is only 1 point ( x, y ), we have, 1 f ( x ) = 1+ e − ( w · x + b ) ∇ w = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x For two points, 2 � ∇ w = ( f ( x i ) − y i ) ∗ f ( x i ) ∗ (1 − f ( x i )) ∗ x i i =1 2 � ∇ b = ( f ( x i ) − y i ) ∗ f ( x i ) ∗ (1 − f ( x i )) i =1 29 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  30. 30 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  31. y When the curve is steep the gradient f ( x ) = x 2 + 1 ( ∆ y 1 ∆ x 1 ) is large 6 When the curve is gentle the gradient ( ∆ y 2 5 ∆ x 2 ) is small Recall that our weight updates are 4 proportional to the gradient w = w − ∆ y 1 η ∇ w 3 Hence in the areas where the curve is gentle the updates are small whereas ∆ x 1 2 in the areas where the curve is steep ∆ y 2 the updates are large 1 ∆ x 2 0 x − 1 0 1 2 3 4 31 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  32. Let’s see what happens when we start from a differ- ent point 32 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  33. Irrespective of where we start from once we hit a surface which has a gentle slope, the progress slows down 33 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  34. Module 5.3 : Contours 34 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

  35. Visualizing things in 3d can sometimes become a bit cumbersome Can we do a 2d visualization of this traversal along the error surface Yes, let’s take a look at something known as con- tours 35 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend