back propagation
play

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - PowerPoint PPT Presentation

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the Worlds Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter! Training the worlds


  1. Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

  2. back to the… World’s Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter!

  3. Training the world’s smallest perceptron This is just gradient descent, that means… this should be the gradient of the loss function Now where does this come from?

  4. d L …is the rate at which this will change… dw L = 1 y ) 2 2( y − ˆ the loss function … per unit change of this y = wx the weight parameter Let’s compute the derivative…

  5. Compute the derivative ⇢ 1 � dw = d d L y ) 2 2( y � ˆ dw y ) dwx = � ( y � ˆ dw = � ( y � ˆ y ) x = r w just shorthand That means the weight update for gradient descent is: w = w � r w move in direction of negative gradient = w + ( y � ˆ y ) x

  6. Gradient Descent (world’s smallest perceptron) For each sample { x i , y i } 1. Predict a. Forward pass y = wx i ˆ L i = 1 b. Compute Loss y ) 2 2( y i − ˆ 2. Update d L i a. Back Propagation dw = � ( y i � ˆ y ) x i = r w b. Gradient update w = w � r w

  7. Training the world’s smallest perceptron

  8. world’s (second) smallest perceptron ! w 1 x 1 y w 2 x 2 function of two parameters!

  9. Gradient Descent For each sample { x i , y i } 1. Predict a. Forward pass b. Compute Loss we just need to compute partial 2. Update derivatives for this network a. Back Propagation b. Gradient update

  10. Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Why do we have partial derivatives now?

  11. Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Gradient Update w 1 = w 1 � η r w 1 w 2 = w 2 � η r w 2 = w 1 + η ( y � ˆ y ) x 1 = w 2 + η ( y � ˆ y ) x 2

  12. Gradient Descent (since gradients For each sample { x i , y i } approximated from stochastic sample) 1. Predict a. Forward pass y = f MLP ( x i ; θ ) ˆ L i = 1 b. Compute Loss 2( y i − ˆ y ) 2. Update two BP lines now r w 1 i = � ( y i � ˆ y ) x 1 i a. Back Propagation r w 2 i = � ( y i � ˆ y ) x 2 i w 1 i = w 1 i + η ( y − ˆ y ) x 1 i b. Gradient update w 2 i = w 2 i + η ( y − ˆ y ) x 2 i (adjustable step size)

  13. We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…

  14. multi-layer perceptron w 1 w 2 w 3 h 1 h 2 y x o b 1 function of FOUR parameters and FOUR layers!

  15. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4

  16. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4

  17. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1

  18. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1

  19. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 )

  20. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 )

  21. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))

  22. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))

  23. sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )) y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )))

  24. Entire network can be written out as one long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) We need to train the network: What is known? What is unknown?

  25. Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) known We need to train the network: What is known? What is unknown?

  26. Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) activation function unknown sometimes has parameters We need to train the network: What is known? What is unknown?

  27. Learning an MLP Given a set of samples and a MLP { x i , y i } y = f MLP ( x ; θ ) Estimate the parameters of the MLP θ = { f, w, b }

  28. Stochastic Gradient Descent For each random sample { x i , y i } 1. Predict y = f MLP ( x i ; θ ) ˆ a. Forward pass b. Compute Loss 2. Update ∂ L a. Back Propagation vector of parameter partial derivatives ∂θ b. Gradient update θ θ � η r θ vector of parameter update equations

  29. So we need to compute the partial derivatives  ∂ L � ∂ L ∂ L ∂ L ∂ L ∂ θ = ∂ w 3 ∂ w 2 ∂ w 1 ∂ b

  30. Remember, ∂ L Partial derivative describes… ∂ w 1 affect… this …this does how (loss layer) f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x b 1 So, how do you compute it?

  31. The Chain Rule

  32. According to the chain rule… ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 ∂ L Intuitively, the effect of weight on loss function : ∂ w 3 f 3 f 2 ˆ L ( y, ˆ y ) a 3 w 3 y rest of the network · · · depends on depends on ∂ f 3 depends on ∂ a 3 ∂ L ∂ a 3 ∂ w 3 ∂ f 3

  33. f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 Chain Rule! ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3

  34. f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Just the partial derivative of L2 loss

  35. f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function ds ( x ) = s ( x )(1 − s ( x )) dx

  36. f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ds ( x ) = s ( x )(1 − s ( x )) ∂ w 3 dx

  37. f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ∂ w 3 = − η ( y − ˆ y ) f 3 (1 − f 3 ) f 2

  38. f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2

  39. f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 already computed. re-use (propagate)!

  40. The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1

  41. The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 already computed. re-use (propagate)!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend