Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - PowerPoint PPT Presentation

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

back to the… World’s Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter!

Training the world’s smallest perceptron This is just gradient descent, that means… this should be the gradient of the loss function Now where does this come from?

d L …is the rate at which this will change… dw L = 1 y ) 2 2( y − ˆ the loss function … per unit change of this y = wx the weight parameter Let’s compute the derivative…

Compute the derivative ⇢ 1 � dw = d d L y ) 2 2( y � ˆ dw y ) dwx = � ( y � ˆ dw = � ( y � ˆ y ) x = r w just shorthand That means the weight update for gradient descent is: w = w � r w move in direction of negative gradient = w + ( y � ˆ y ) x

Gradient Descent (world’s smallest perceptron) For each sample { x i , y i } 1. Predict a. Forward pass y = wx i ˆ L i = 1 b. Compute Loss y ) 2 2( y i − ˆ 2. Update d L i a. Back Propagation dw = � ( y i � ˆ y ) x i = r w b. Gradient update w = w � r w

Training the world’s smallest perceptron

world’s (second) smallest perceptron ! w 1 x 1 y w 2 x 2 function of two parameters!

Gradient Descent For each sample { x i , y i } 1. Predict a. Forward pass b. Compute Loss we just need to compute partial 2. Update derivatives for this network a. Back Propagation b. Gradient update

Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Why do we have partial derivatives now?

Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Gradient Update w 1 = w 1 � η r w 1 w 2 = w 2 � η r w 2 = w 1 + η ( y � ˆ y ) x 1 = w 2 + η ( y � ˆ y ) x 2

Gradient Descent (since gradients For each sample { x i , y i } approximated from stochastic sample) 1. Predict a. Forward pass y = f MLP ( x i ; θ ) ˆ L i = 1 b. Compute Loss 2( y i − ˆ y ) 2. Update two BP lines now r w 1 i = � ( y i � ˆ y ) x 1 i a. Back Propagation r w 2 i = � ( y i � ˆ y ) x 2 i w 1 i = w 1 i + η ( y − ˆ y ) x 1 i b. Gradient update w 2 i = w 2 i + η ( y − ˆ y ) x 2 i (adjustable step size)

We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…

multi-layer perceptron w 1 w 2 w 3 h 1 h 2 y x o b 1 function of FOUR parameters and FOUR layers!

sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4

sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1

sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 )

sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))

sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )) y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )))

Entire network can be written out as one long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) We need to train the network: What is known? What is unknown?

Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) known We need to train the network: What is known? What is unknown?

Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) activation function unknown sometimes has parameters We need to train the network: What is known? What is unknown?

Learning an MLP Given a set of samples and a MLP { x i , y i } y = f MLP ( x ; θ ) Estimate the parameters of the MLP θ = { f, w, b }

Stochastic Gradient Descent For each random sample { x i , y i } 1. Predict y = f MLP ( x i ; θ ) ˆ a. Forward pass b. Compute Loss 2. Update ∂ L a. Back Propagation vector of parameter partial derivatives ∂θ b. Gradient update θ θ � η r θ vector of parameter update equations

So we need to compute the partial derivatives  ∂ L � ∂ L ∂ L ∂ L ∂ L ∂ θ = ∂ w 3 ∂ w 2 ∂ w 1 ∂ b

Remember, ∂ L Partial derivative describes… ∂ w 1 affect… this …this does how (loss layer) f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x b 1 So, how do you compute it?

The Chain Rule

According to the chain rule… ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 ∂ L Intuitively, the effect of weight on loss function : ∂ w 3 f 3 f 2 ˆ L ( y, ˆ y ) a 3 w 3 y rest of the network · · · depends on depends on ∂ f 3 depends on ∂ a 3 ∂ L ∂ a 3 ∂ w 3 ∂ f 3

f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 Chain Rule! ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3

f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Just the partial derivative of L2 loss

f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function ds ( x ) = s ( x )(1 − s ( x )) dx

f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ds ( x ) = s ( x )(1 − s ( x )) ∂ w 3 dx

f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ∂ w 3 = − η ( y − ˆ y ) f 3 (1 − f 3 ) f 2

f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2

f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 already computed. re-use (propagate)!

The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1

The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 already computed. re-use (propagate)!

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - PowerPoint PPT Presentation

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the Worlds Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter! Training the worlds

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Geometric Sound Transmission Micah Taylor Overview Geometric propagation Very fast Can be

Lecture no: 2 Short on dB calculations Basics about antennas Propagation mechanisms

Amateur Radio License Propagation and Antennas Todays Topics Propagation Antennas

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m g h g f w old M

THE BACK THE MUSCLES THE MUSCLES OF THE BACK THE MUSCLES OF THE BACK Muscles of the back are

PREVENTION OF LOW BACK PAIN : Back Education for Workplace and Home PRESENTED AT BODIJA-ASHI

Disability Claims Process Personal support through each step to get back to health, back to work

Back To Work Back To Work Back To Work Back To Work Q2 F2008 Results 31 January 2008 y

Preventing Back Injuries Preventing Back Injuries Back injuries are the nation s number

Technician License Course Chapter 4 Lesson Plan Module 8 Propagation Radio Wave Propagation:

Propagation by Cuttings Propagation by Cuttings Rooting success is almost entirely dependent

On Stable Marriages and Greedy Matchings Fredrik Manne University of Bergen, Norway Md. Naim,

What Matchings Can be Stable? Refutability in Matching Theory Federico Echenique California

Internet Engineering: VoiceXML Ali Kamandi Sharif University of Technology Fall 2007

Streaming -submodular Maximization under Noise subject to Size Constraint Lan N. Nguyen, My

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - PowerPoint PPT Presentation

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the Worlds Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter! Training the worlds

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Geometric Sound Transmission Micah Taylor Overview Geometric propagation Very fast Can be

Lecture no: 2 Short on dB calculations Basics about antennas Propagation mechanisms

Amateur Radio License Propagation and Antennas Todays Topics Propagation Antennas

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m g h g f w old M

THE BACK THE MUSCLES THE MUSCLES OF THE BACK THE MUSCLES OF THE BACK Muscles of the back are

PREVENTION OF LOW BACK PAIN : Back Education for Workplace and Home PRESENTED AT BODIJA-ASHI

Disability Claims Process Personal support through each step to get back to health, back to work

Back To Work Back To Work Back To Work Back To Work Q2 F2008 Results 31 January 2008 y

Preventing Back Injuries Preventing Back Injuries Back injuries are the nation s number

Technician License Course Chapter 4 Lesson Plan Module 8 Propagation Radio Wave Propagation:

Propagation by Cuttings Propagation by Cuttings Rooting success is almost entirely dependent

On Stable Marriages and Greedy Matchings Fredrik Manne University of Bergen, Norway Md. Naim,

What Matchings Can be Stable? Refutability in Matching Theory Federico Echenique California

Internet Engineering: VoiceXML Ali Kamandi Sharif University of Technology Fall 2007

Streaming -submodular Maximization under Noise subject to Size Constraint Lan N. Nguyen, My

Machine Learning - MT 2016 11 &amp; 12. Neural Networks Varun Kanade University of Oxford

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford