Back-Propagation
16-385 Computer Vision (Kris Kitani)
Carnegie Mellon University
Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - - PowerPoint PPT Presentation
Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the Worlds Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter! Training the worlds
16-385 Computer Vision (Kris Kitani)
Carnegie Mellon University
y = wx back to the… function of ONE parameter!
(a.k.a. line equation, linear regression)
y x w f
Training the world’s smallest perceptron
this should be the gradient of the loss function Now where does this come from? This is just gradient descent, that means…
L = 1 2(y − ˆ y)2
…is the rate at which this will change… … per unit change of this
the loss function the weight parameter
Let’s compute the derivative…
Compute the derivative That means the weight update for gradient descent is:
just shorthand
move in direction of negative gradient
Gradient Descent (world’s smallest perceptron) For each sample
{xi, yi}
ˆ y = wxi dLi dw = (yi ˆ y)xi = rw Li = 1 2(yi − ˆ y)2 w = w rw
Training the world’s smallest perceptron
y
function of two parameters! w1 w2 x1 x2
Gradient Descent For each sample
{xi, yi}
we just need to compute partial derivatives for this network
Back-Propagation ∂L ∂w1 = ∂ ∂w1 ⇢1 2(y ˆ y)2
y) ∂ˆ y ∂w1 = (y ˆ y)∂ P
i wixi
∂w1 = (y ˆ y)∂w1x1 ∂w1 = (y ˆ y)x1 = rw1 ∂L ∂w2 = ∂ ∂w2 ⇢1 2(y ˆ y)2
y) ∂ˆ y ∂w2 = (y ˆ y)∂ P
i wixi
∂w1 = (y ˆ y)∂w2x2 ∂w2 = (y ˆ y)x2 = rw2 Why do we have partial derivatives now?
Back-Propagation ∂L ∂w1 = ∂ ∂w1 ⇢1 2(y ˆ y)2
y) ∂ˆ y ∂w1 = (y ˆ y)∂ P
i wixi
∂w1 = (y ˆ y)∂w1x1 ∂w1 = (y ˆ y)x1 = rw1 ∂L ∂w2 = ∂ ∂w2 ⇢1 2(y ˆ y)2
y) ∂ˆ y ∂w2 = (y ˆ y)∂ P
i wixi
∂w1 = (y ˆ y)∂w2x2 ∂w2 = (y ˆ y)x2 = rw2 w1 = w1 ηrw1 = w1 + η(y ˆ y)x1 w2 = w2 ηrw2 = w2 + η(y ˆ y)x2 Gradient Update
Gradient Descent For each sample
{xi, yi}
Li = 1 2(yi − ˆ y)
ˆ y = fMLP(xi; θ)
w1i = w1i + η(y − ˆ y)x1i w2i = w2i + η(y − ˆ y)x2i
rw1i = (yi ˆ y)x1i rw2i = (yi ˆ y)x2i
(since gradients approximated from stochastic sample) (adjustable step size) two BP lines now
We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…
x function of FOUR parameters and FOUR layers! w1 w2 b 1 h1 h2 w3 y
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
x w1 a1 f1 w2 y f2
hidden layer 2 hidden layer 3
layer 4
weight input activation sum weight weight
a2 a3 w3 f3
activation activation
input layer 1
b1
Entire network can be written out as one long equation What is known? What is unknown? We need to train the network:
Entire network can be written out as a long equation What is known? What is unknown?
known
We need to train the network:
Entire network can be written out as a long equation What is known? What is unknown?
unknown
We need to train the network:
activation function sometimes has parameters
Given a set of samples and a MLP Estimate the parameters of the MLP
Stochastic Gradient Descent For each random sample
{xi, yi}
ˆ y = fMLP(xi; θ)
∂L ∂θ
vector of parameter update equations vector of parameter partial derivatives
θ θ ηrθ
So we need to compute the partial derivatives
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y
how …this this does affect…
∂L ∂w1 Partial derivative describes… So, how do you compute it?
(loss layer)
Remember,
∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 f2 a3 w3 f3
· · ·
∂L ∂w3
∂L ∂f3 ∂f3 ∂a3
∂a3 ∂w3
rest of the network
L(y, ˆ y) ˆ y Intuitively, the effect of weight on loss function :
depends on depends on depends on
According to the chain rule…
f2 a3 w3 f3
rest of the network
L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3
Chain Rule!
f2 a3 w3 f3
rest of the network
L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3
Just the partial derivative of L2 loss
f2 a3 w3 f3
rest of the network
L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3
Let’s use a Sigmoid function
ds(x) dx = s(x)(1 − s(x))
f2 a3 w3 f3
rest of the network
L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3) ∂a3 ∂w3
Let’s use a Sigmoid function
ds(x) dx = s(x)(1 − s(x))
f2 a3 w3 f3
rest of the network
L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3) ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3)f2
x w1 a1 f1 w2 y f2 a2 a3 w3 f3 b1
x w1 a1 f1 w2 y f2 a2 a3 w3 f3 b1
already computed. re-use (propagate)!
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y The Chain rule says…
depends on depends on depends on depends on depends on depends on depends on
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y The Chain rule says…
depends on depends on depends on depends on depends on depends on depends on
already computed. re-use (propagate)!
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y
depends on depends on depends on depends on depends on depends on depends on
∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y
depends on depends on depends on depends on depends on depends on depends on
∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b
x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y
depends on depends on depends on depends on depends on depends on depends on
∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b
Stochastic Gradient Descent For each example sample
{xi, yi}
ˆ y = fMLP(xi; θ)
w3 = w3 ηrw3 w2 = w2 ηrw2 w1 = w1 ηrw1 b = b ηrb
Li
∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b
∂L ∂θ θ ← θ + η ∂L ∂θ
vector of parameter update equations vector of parameter partial derivatives
Stochastic Gradient Descent For each example sample
{xi, yi}
ˆ y = fMLP(xi; θ) Li