neural networks backpropagation
play

Neural Networks: Backpropagation Machine Learning Based on slides - PowerPoint PPT Presentation

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, 1 Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others This lecture What is a neural network?


  1. Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, 1 Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

  2. This lecture • What is a neural network? • Predicting with a neural network • Training neural networks – Backpropagation • Practical concerns 3

  3. Training a neural network • Given – A network architecture (layout of neurons, their connectivity and activations) – A dataset of labeled examples • S = {( x i , y i )} • The goal: Learn the weights of the neural network • Remember : For a fixed architecture, a neural network is a function parameterized by its weights – Prediction: ! = $$(&, () 4

  4. � Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss * min ( . *($$ & / , ( , ! / ) Perhaps with a regularizer / So far, we saw that this strategy worked for: 1. Logistic Regression Each 2. Support Vector Machines minimizes a 3. Perceptron different loss function 4. LMS regression All of these are linear models Same idea for non-linear models too! 6

  5. Back to our running example Given an input x , how is the output predicted 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 7

  6. Back to our running example Given an input x , how is the output predicted 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 Suppose the true label for this example is a number ! / We can write the square loss for this example as: * = 1 2 !– ! / 8 9

  7. � Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss * min ( . *($$ ; / , 2 , ! / ) Perhaps with a regularizer / How do we solve the optimization problem? 10

  8. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 11

  9. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 12

  10. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 13

  11. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 14

  12. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 15

  13. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • ° t : learning rate, many tweaks possible 3. Return w 17

  14. � min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • ° t : learning rate, many tweaks possible Have we solved everything? 3. Return w 18

  15. The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 19

  16. The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 20

  17. The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 21

  18. Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 22

  19. Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 23

  20. Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 24

  21. Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 25

  22. Reminder: Chain rule for derivatives – If 7 is a function of ! and ! is a function of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 27 Slide courtesy Richard Socher

  23. Reminder: Chain rule for derivatives – If 7 = a function of ! 4 + a function of ! 8 , and the ! / ’s are functions of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 28 Slide courtesy Richard Socher

  24. Reminder: Chain rule for derivatives – If 7 is a sum of functions of ! / ’s, and the ! / ’s are functions of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 29 Slide courtesy Richard Socher

  25. Backpropagation * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 30

  26. Backpropagation * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 We want to compute FI FI M and N FJ KL FJ KL 31

  27. Applying the chain rule to compute the gradient Backpropagation (And remembering partial computations along the way to speed up things) * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 We want to compute FI FI M and N FJ KL FJ KL 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend