CSC 411: Lecture 10: Neural Networks I
Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler
University of Toronto
Feb 10, 2016
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 1 / 62
CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun - - PowerPoint PPT Presentation
CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 10, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 1 / 62 Today
University of Toronto
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 1 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 2 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 3 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 4 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62
◮ If these functions are fixed (Gaussian, sigmoid, polynomial basis
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62
◮ If these functions are fixed (Gaussian, sigmoid, polynomial basis
◮ Or we can make these functions depend on additional parameters →
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62
[Pic credit: http://cs231n.github.io/neural-networks-1/]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 7 / 62
[Pic credit: http://cs231n.github.io/neural-networks-1/]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 8 / 62
1 1+exp(−z)
exp(z)+exp(−z)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 9 / 62
[http://cs231n.github.io/neural-networks-1/]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 10 / 62
units, 4 hidden units and 2 output units
[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 11 / 62
units, 4 hidden units and 2 output units
◮ One layer of hidden units ◮ One output layer
[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 12 / 62
hidden layer and 1 output unit
◮ N − 1 layers of hidden units ◮ One output layer
[http://cs231n.github.io/neural-networks-1/]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 13 / 62
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
Caruana, Paper: paper]
[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62
◮ Forward pass: performs inference ◮ Backward pass: performs learning Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 15 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
D
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
D
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
D
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62
[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
w N
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
w N
◮ Squared loss:
k 1 2(o(n) k
k )2
◮ Cross-entropy loss: −
k t(n) k
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
w N
◮ Squared loss:
k 1 2(o(n) k
k )2
◮ Cross-entropy loss: −
k t(n) k
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
1 1+exp(−z)
exp(z)+exp(−z)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 21 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
◮ for each example n
(forward pass)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
◮ for each example n
(forward pass)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
◮ Instead of using desired activities to train the hidden units, use error
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
◮ Instead of using desired activities to train the hidden units, use error
◮ Each hidden activity can affect many output units and can therefore
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
◮ Instead of using desired activities to train the hidden units, use error
◮ Each hidden activity can affect many output units and can therefore
◮ We can compute error derivatives for all the hidden units efficiently Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
◮ Instead of using desired activities to train the hidden units, use error
◮ Each hidden activity can affect many output units and can therefore
◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
◮ Instead of using desired activities to train the hidden units, use error
◮ Each hidden activity can affect many output units and can therefore
◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 26 / 62
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 27 / 62
k · ∂ok
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 28 / 62
k
k · xi
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 29 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k ) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k )
∂E ∂wki =
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k )
∂E ∂wki =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wki =
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k )
∂E ∂wki =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wki =
N
(o(n)
k
− t(n)
k )o(n) k (1 − o(n) k )x(n) i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k )
∂E ∂wki =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wki =
N
(o(n)
k
− t(n)
k )o(n) k (1 − o(n) k )x(n) i
wki ← wki − η ∂E ∂wki =
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
∂E ∂o(n)
k
= o(n)
k
− t(n)
k
:= δo
k
k
= g(z(n)
k ) = (1 + exp(−z(n) k ))−1
∂o(n)
k
∂z(n)
k
=
k (1 − o(n) k )
∂E ∂wki =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wki =
N
(o(n)
k
− t(n)
k )o(n) k (1 − o(n) k )x(n) i
wki ← wki − η ∂E ∂wki = wki − η
N
(o(n)
k
− t(n)
k )o(n) k (1 − o(n) k )x(n) i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 31 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂h(n)
j
=
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂h(n)
j
=
δz,(n)
k
wkj := δh,(n)
j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂h(n)
j
=
δz,(n)
k
wkj := δh,(n)
j
∂E ∂vji =
N
∂E ∂h(n)
j
∂h(n)
j
∂u(n)
j
∂u(n)
j
∂vji =
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂h(n)
j
=
δz,(n)
k
wkj := δh,(n)
j
∂E ∂vji =
N
∂E ∂h(n)
j
∂h(n)
j
∂u(n)
j
∂u(n)
j
∂vji =
N
δh,(n)
j
f ′(u(n)
j
) ∂u(n)
j
∂vji =
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62
∂E ∂wkj =
N
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂wkj =
N
δz,(n)
k
h(n)
j
∂E ∂h(n)
j
=
∂E ∂o(n)
k
∂o(n)
k
∂z(n)
k
∂z(n)
k
∂h(n)
j
=
δz,(n)
k
wkj := δh,(n)
j
∂E ∂vji =
N
∂E ∂h(n)
j
∂h(n)
j
∂u(n)
j
∂u(n)
j
∂vji =
N
δh,(n)
j
f ′(u(n)
j
) ∂u(n)
j
∂vji =
N
δu,(n)
j
x(n)
i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 36 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
N
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
N
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
E = −
t(n)
k
log o(n)
k
k
= exp(z(n)
k )
j
)
∂ok ∂zk = ok(1 − ok) ∂E ∂zk =
∂E ∂oj ∂oj ∂zk = (ok − tk)ok(1 − ok)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 38 / 62
J
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 39 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases
◮ Use a fixed learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases
◮ Use a fixed learning rate ◮ Adapt the learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
◮ after a full sweep through the training data (batch gradient descent)
N
◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases
◮ Use a fixed learning rate ◮ Adapt the learning rate ◮ Add momentum
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
[http://cs231n.github.io/neural-networks-3/, Alec Radford]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 41 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 42 / 62
[http://cs231n.github.io/neural-networks-3/]
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 43 / 62
“dog” Classification c l a s s i f i c a t i
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62
“dog” Classification c l a s s i f i c a t i
“dog” Classification
[Picture from M. Ranzato] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
◮ x is the input Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
◮ x is the input ◮ y is the output (what we want to predict) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
◮ x is the input ◮ y is the output (what we want to predict) ◮ hi is the i-th hidden layer Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
◮ x is the input ◮ y is the output (what we want to predict) ◮ hi is the i-th hidden layer ◮ Wi are the parameters of the i-th layer Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 46 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
1 x + b1)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 46 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
1 x + b1)
2 h1 + b2)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 47 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
1 x + b1)
2 h1 + b2)
3 h2 + b3)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 48 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62
j=1 exp(yj)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62
j=1 exp(yj)
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62
j=1 exp(yj)
k
w
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
∂ℓ ∂y Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
∂ℓ ∂y
j=1 exp(yj)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
∂ℓ ∂y
j=1 exp(yj)
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
∂ℓ ∂y
j=1 exp(yj)
k
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
h2 W T
3 h2 + b3
y
∂ℓ ∂y
j=1 exp(yj)
k
∂y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
h1 max(0, W T
2 h1 + b2)
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂y if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62
x max(0, W T
1 x + b1)
max(0, W T
2 h1 + b2) ∂ℓ ∂h1
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂ℓ ∂h2 if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62
x max(0, W T
1 x + b1)
max(0, W T
2 h1 + b2) ∂ℓ ∂h1
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂ℓ ∂h2 if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62
x max(0, W T
1 x + b1)
max(0, W T
2 h1 + b2) ∂ℓ ∂h1
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂ℓ ∂h2 if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62
x max(0, W T
1 x + b1)
max(0, W T
2 h1 + b2) ∂ℓ ∂h1
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂ℓ ∂h2 if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62
x max(0, W T
1 x + b1)
max(0, W T
2 h1 + b2) ∂ℓ ∂h1
W T
3 h2 + b3 ∂ℓ ∂h2
y
∂ℓ ∂y
∂ℓ ∂h2 if we can compute the Jacobian of each module
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62
28
% F-PROP for i = 1 : nr_layers - 1 [h{i} jac{i}] = nonlinearity(W{i} * h{i-1} + b{i}); end h{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} + b{nr_layers-1}; prediction = softmax(h{l-1}); % CROSS ENTROPY LOSS loss = - sum(sum(log(prediction) .* target)) / batch_size; % B-PROP dh{l-1} = prediction - target; for i = nr_layers – 1 : -1 : 1 Wgrad{i} = dh{i} * h{i-1}'; bgrad{i} = sum(dh{i}, 2); dh{i-1} = (W{i}' * dh{i}) .* jac{i-1}; end % UPDATE for i = 1 : nr_layers - 1 W{i} = W{i} – (lr / batch_size) * Wgrad{i}; b{i} = b{i} – (lr / batch_size) * bgrad{i}; end Ranzato This code has a few bugs with indices...
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 54 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
◮ The target values may be unreliable. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just
◮ So it fits both kinds of regularity. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just
◮ So it fits both kinds of regularity. ◮ If the model is very flexible it can model the sampling error really well.
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are
◮ Limit the number of hidden units. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are
◮ Limit the number of hidden units. ◮ Limit the norm of the weights. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are
◮ Limit the number of hidden units. ◮ Limit the norm of the weights. ◮ Stop the learning before it has time to overfit. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62
i
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 57 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62
◮ This can often improve generalization a lot. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62
◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62
◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. ◮ It makes a smoother model in which the output changes more slowly as
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62
◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. ◮ It makes a smoother model in which the output changes more slowly as
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 59 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 59 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62
◮ Training data is used for learning the parameters of the model. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62
◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62
◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what
◮ Test data is used to get a final, unbiased estimate of how well the
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62
◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what
◮ Test data is used to get a final, unbiased estimate of how well the
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62
inputs
◮ So a net with a large layer of hidden
◮ It has no more capacity than a linear
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 62 / 62