CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 - PowerPoint PPT Presentation

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21

Overview We’ve seen that multilayer neural networks are powerful. But how can we actually learn them? Backpropagation is the central algorithm in this course. It’s is an algorithm for computing gradients. Really it’s an instance of reverse mode automatic differentiation, which is much more broadly applicable than just neural nets. This is “just” a clever and efficient use of the Chain Rule for derivatives. David Duvenaud will tell you more about this next week. Roger Grosse CSC321 Lecture 6: Backpropagation 2 / 21

Overview Design choices so far Task: regression, binary classification, multiway classification Model/Architecture: linear, log-linear, multilayer perceptron Loss function: squared error, 0–1 loss, cross-entropy, hinge loss Optimization algorithm: direct solution, gradient descent, perceptron Compute gradients using backpropagation Roger Grosse CSC321 Lecture 6: Backpropagation 3 / 21

Recap: Gradient Descent Recall: gradient descent moves opposite the gradient (the direction of steepest descent) Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize! We want to compute the cost gradient d E / d w , which is the vector of partial derivatives. This is the average of d L / d w over all the training examples, so in this lecture we focus on computing d L / d w . Roger Grosse CSC321 Lecture 6: Backpropagation 4 / 21

Univariate Chain Rule We’ve already been using the univariate Chain Rule. Recall: if f ( x ) and x ( t ) are univariate functions, then d t f ( x ( t )) = d f d d x · d x d t . Roger Grosse CSC321 Lecture 6: Backpropagation 5 / 21

Univariate Chain Rule Recall: Univariate logistic least squares model z = wx + b y = σ ( z ) L = 1 2( y − t ) 2 Let’s compute the loss derivatives. Roger Grosse CSC321 Lecture 6: Backpropagation 6 / 21

Univariate Chain Rule How you would have done it in calculus class L = 1 2 ( σ ( wx + b ) − t ) 2 � 1 � ∂ L ∂ b = ∂ 2 ( σ ( wx + b ) − t ) 2 � 1 ∂ L ∂ � ∂ b 2 ( σ ( wx + b ) − t ) 2 ∂ w = ∂ w = 1 ∂ ∂ b ( σ ( wx + b ) − t ) 2 = 1 ∂ 2 ∂ w ( σ ( wx + b ) − t ) 2 = ( σ ( wx + b ) − t ) ∂ 2 ∂ b ( σ ( wx + b ) − t ) = ( σ ( wx + b ) − t ) ∂ ∂ w ( σ ( wx + b ) − t ) = ( σ ( wx + b ) − t ) σ ′ ( wx + b ) ∂ ∂ b ( wx + b ) = ( σ ( wx + b ) − t ) σ ′ ( wx + b ) ∂ ∂ w ( wx + b ) = ( σ ( wx + b ) − t ) σ ′ ( wx + b ) = ( σ ( wx + b ) − t ) σ ′ ( wx + b ) x What are the disadvantages of this approach? Roger Grosse CSC321 Lecture 6: Backpropagation 7 / 21

Univariate Chain Rule A more structured way to do it Computing the derivatives: Computing the loss: d L d y = y − t z = wx + b d L d z = d L y = σ ( z ) d y · σ ′ ( z ) L = 1 2( y − t ) 2 ∂ L ∂ w = d L d z · x ∂ L ∂ b = d L d z Remember, the goal isn’t to obtain closed-form solutions, but to be able to write a program that efficiently computes the derivatives. Roger Grosse CSC321 Lecture 6: Backpropagation 8 / 21

Univariate Chain Rule We can diagram out the computations using a computation graph. The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes. Roger Grosse CSC321 Lecture 6: Backpropagation 9 / 21

Univariate Chain Rule A slightly more convenient notation: Use y to denote the derivative d L / d y , sometimes called the error signal. This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation). This is not a standard notation, but I couldn’t find another one that I liked. Computing the loss: Computing the derivatives: z = wx + b y = y − t y = σ ( z ) z = y · σ ′ ( z ) L = 1 2( y − t ) 2 w = z · x b = z Roger Grosse CSC321 Lecture 6: Backpropagation 10 / 21

Multivariate Chain Rule Problem: what if the computation graph has fan-out > 1? This requires the multivariate Chain Rule! Multclass logistic regression L 2 -Regularized regression z = wx + b y = σ ( z ) � z ℓ = w ℓ j x j + b ℓ L = 1 2( y − t ) 2 j e z k R = 1 2 w 2 y k = � ℓ e z ℓ L reg = L + λ R � L = − t k log y k k Roger Grosse CSC321 Lecture 6: Backpropagation 11 / 21

Multivariate Chain Rule Suppose we have a function f ( x , y ) and functions x ( t ) and y ( t ). (All the variables here are scalar-valued.) Then d t f ( x ( t ) , y ( t )) = ∂ f d t + ∂ f d d x d y ∂ x ∂ y d t Example: f ( x , y ) = y + e xy x ( t ) = cos t y ( t ) = t 2 Plug in to Chain Rule: d t = ∂ f d t + ∂ f d f d x d y ∂ x ∂ y d t = ( ye xy ) · ( − sin t ) + (1 + xe xy ) · 2 t Roger Grosse CSC321 Lecture 6: Backpropagation 12 / 21

Multivariable Chain Rule In the context of backpropagation: In our notation: t = x · d x d t + y · d y d t Roger Grosse CSC321 Lecture 6: Backpropagation 13 / 21

Backpropagation Full backpropagation algorithm: Let v 1 , . . . , v N be a topological ordering of the computation graph (i.e. parents come before children.) v N denotes the variable we’re trying to compute derivatives of (e.g. loss). Roger Grosse CSC321 Lecture 6: Backpropagation 14 / 21

Backpropagation Example: univariate logistic least squares regression Backward pass: L reg = 1 z = y d y R = L reg d L reg d z d R = y σ ′ ( z ) Forward pass: = L reg λ w = z ∂ z ∂ w + R d R L = L reg d L reg z = wx + b d w d L = z x + R w y = σ ( z ) = L reg b = z ∂ z L = 1 2( y − t ) 2 y = L d L ∂ b d y = z R = 1 2 w 2 = L ( y − t ) L reg = L + λ R Roger Grosse CSC321 Lecture 6: Backpropagation 15 / 21

Backpropagation Multilayer Perceptron (multiple outputs): Backward pass: L = 1 y k = L ( y k − t k ) w (2) = y k h i ki b (2) = y k k Forward pass: y k w (2) � h i = ki w (1) ij x j + b (1) k � z i = i z i = h i σ ′ ( z i ) j w (1) h i = σ ( z i ) = z i x j ij w (2) ki h i + b (2) � y k = b (1) = z i k i i L = 1 � ( y k − t k ) 2 2 k Roger Grosse CSC321 Lecture 6: Backpropagation 16 / 21

Backpropagation In vectorized form: Backward pass: L = 1 y = L ( y − t ) W (2) = yh ⊤ Forward pass: b (2) = y h = W (2) ⊤ y z = W (1) x + b (1) z = h · σ ′ ( z ) h = σ ( z ) y = W (2) h + b (2) W (1) = zx ⊤ L = 1 b (1) = z 2 � t − y � 2 Roger Grosse CSC321 Lecture 6: Backpropagation 17 / 21

Backpropagation Backprop as message passing: Each node receives a bunch of messages from its children, which it aggregates to get its error signal. It then passes messages to its parents. This provides modularity, since each node only has to know how to compute derivatives with respect to its arguments, and doesn’t have to know anything about the rest of the graph. Roger Grosse CSC321 Lecture 6: Backpropagation 18 / 21

Computational Cost Computational cost of forward pass: one add-multiply operation per weight w (1) ij x j + b (1) � z i = i j Computational cost of backward pass: two add-multiply operations per weight w (2) = y k h i ki y k w (2) � h i = ki k Rule of thumb: the backward pass is about as expensive as two forward passes. For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer. Roger Grosse CSC321 Lecture 6: Backpropagation 19 / 21

Backpropagation Backprop is used to train the overwhelming majority of neural nets today. Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients. Despite its practical success, backprop is believed to be neurally implausible. No evidence for biological signals analogous to error derivatives. All the biologically plausible alternatives we know about learn much more slowly (on computers). So how on earth does the brain learn? Roger Grosse CSC321 Lecture 6: Backpropagation 20 / 21

Backpropagation By now, we’ve seen three different ways of looking at gradients: Geometric: visualization of gradient in weight space Algebraic: mechanics of computing the derivatives Implementational: efficient implementation on the computer When thinking about neural nets, it’s important to be able to shift between these different perspectives! Roger Grosse CSC321 Lecture 6: Backpropagation 21 / 21

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 - PowerPoint PPT Presentation

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21 Overview Weve seen that multilayer neural networks are powerful. But how can we actually learn them? Backpropagation is the central

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

CSC321 Lecture 1: Introduction Roger Grosse Roger Grosse CSC321 Lecture 1: Introduction 1 / 26

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

CSC321 Lecture 16: Learning Long-Term Dependencies Roger Grosse Roger Grosse CSC321 Lecture 16:

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 22 Final Exam

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

CSC321 Lecture 17: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 17: ResNets

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

CS3000:&Algorithms&&&Data Jonathan&Ullman Lecture&2:&

Forestclaw : Programming paradigms Donna Calhoun (Boise State University) Carsten Burstedde,

Scalable and Precise Taint Analysis for Android Wei Huang 12 , Yao Dong 1 , Ana Milanova 1 ,

Financial Econometrics Econ 40357 Constant expected return model and efficient market hypothesis

On the motion of compressible inviscid fluids driven by stochastic forcing Eduard Feireisl based

MA Macroeconomics 11. The Solow Model Karl Whelan School of Economics, UCD Autumn 2014 Karl

EI331 Signals and Systems Lecture 14 Bo Jiang John Hopcroft Center for Computer Science

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian

Sambuz

Useful Links

Newsletter

Mail Us

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 - PowerPoint PPT Presentation

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21 Overview Weve seen that multilayer neural networks are powerful. But how can we actually learn them? Backpropagation is the central

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

CSC321 Lecture 1: Introduction Roger Grosse Roger Grosse CSC321 Lecture 1: Introduction 1 / 26

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

CSC321 Lecture 16: Learning Long-Term Dependencies Roger Grosse Roger Grosse CSC321 Lecture 16:

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 22 Final Exam

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

CSC321 Lecture 17: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 17: ResNets

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

CS3000:&amp;Algorithms&amp;&amp;&amp;Data Jonathan&amp;Ullman Lecture&amp;2:&amp;

Forestclaw : Programming paradigms Donna Calhoun (Boise State University) Carsten Burstedde,

Scalable and Precise Taint Analysis for Android Wei Huang 12 , Yao Dong 1 , Ana Milanova 1 ,

Financial Econometrics Econ 40357 Constant expected return model and efficient market hypothesis

On the motion of compressible inviscid fluids driven by stochastic forcing Eduard Feireisl based

MA Macroeconomics 11. The Solow Model Karl Whelan School of Economics, UCD Autumn 2014 Karl

EI331 Signals and Systems Lecture 14 Bo Jiang John Hopcroft Center for Computer Science

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian

Sambuz

Useful Links

Newsletter

Mail Us

CS3000:&Algorithms&&&Data Jonathan&Ullman Lecture&2:&