Machine Learning 2
DS 4420 - Spring 2020
Neural Networks & backprop
Byron C Wallace
Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - - PowerPoint PPT Presentation
Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural Networks! In 2020, neural networks are the dominant technology in machine learning (for better or worse)! Today, well spend a day going
DS 4420 - Spring 2020
Byron C Wallace
in machine learning (for better or worse)!
fundamentals of NNs and modern libraries (we saw a preview last time, with auto-diff)!
in machine learning (for better or worse)!
and modern libraries (we saw a preview last week, with auto-diff)!
in machine learning (for better or worse)!
and modern libraries (we saw a preview last week, with auto-diff)!
Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as optimization We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop
Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/
We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop
Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/
We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop
0 if we’re correct 1 if we’re wrong
The simplest loss is probably 0/1 loss: What’s an algo that minimizes this?
1 otherwise Consider a simple linear model with parameters w
x
y
Training data
1 otherwise Consider a simple linear model with parameters w
x
y
Training data
(assumes bias term moved into x or omitted)
1 otherwise Consider a simple linear model with parameters w
x
y
Training data
(assumes bias term moved into x or omitted)
The learning problem is to estimate w
1 otherwise Consider a simple linear model with parameters w
x
y
Training data
(assumes bias term moved into x or omitted)
The learning problem is to estimate w What is our criterion for a good w? Minimal loss
Algorithm 5 PerceptronTrain(D, MaxIter)
1: wd ← 0, for all d = 1 . . . D
// initialize weights
2: b ← 0
// initialize bias
3: for iter = 1 . . . MaxIter do 4:
for all (x,y) ∈ D do
5:
a ← ∑D
d=1 wd xd + b
// compute activation for this example
6:
if ya ≤ 0 then
7:
wd ← wd + yxd, for all d = 1 . . . D
// update weights
8:
b ← b + y
// update bias
9:
end if
10:
end for
11: end for 12: return w0, w1, . . . , wD, b
Fig and Alg from CIML [Daume]
wrong by .9999
instances are not linearly separable
Hinge
Langley
2
Max
g
z
2
w
Xi
raw
yE
I
I
Idea: Introduce a “smooth” loss function to make optimization easier Example: Hinge loss
1 loss
i
2
wrong
correct
signed margin
which
·
Zero/one:
`(0/1)(y, ˆ
y) = 1[y ˆ y ≤ 0] Hinge:
`(hin)(y, ˆ
y) = max{0, 1 − y ˆ y} Logistic:
`(log)(y, ˆ
y) = 1 log 2 log (1 + exp[−y ˆ y]) Exponential:
`(exp)(y, ˆ
y) = exp[−y ˆ y] Squared:
`(sqr)(y, ˆ
y) = (y − ˆ y)2
Fig and Eq’s from CIML [Daume]
min
w,b
n
`(yn, w · xn + b) + lR(w, b)
min
w,b
n
`(yn, w · xn + b) + lR(w, b)
Prevent w from “getting to crazy”
By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355
Algorithm 21 GradientDescent(F, K, η1, . . . )
1: z(0) h0, 0, . . . , 0i
// initialize variable we are optimizing
2: for k = 1 . . . K do 3:
g(k) rzF|z(k-1)
// compute gradient at current location
4:
z(k) z(k-1) η(k)g(k)
// take a step down the gradient
5: end for 6: return z(K)
Alg from CIML [Daume]
rwL = rw ∑
n
exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (
= ∑
n
(rw yn(w · xn + b)) exp
⇥ yn(w · xn + b) ⇤ + λw (
= ∑
n
ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (
rwL = rw ∑
n
exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (
= ∑
n
(rw yn(w · xn + b)) exp
⇥ yn(w · xn + b) ⇤ + λw (
= ∑
n
ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (
rwL = rw ∑
n
exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (
= ∑
n
(rw yn(w · xn + b)) exp
⇥ yn(w · xn + b) ⇤ + λw (
= ∑
n
ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (
W1
wa b
y wut but
WZ
i
y
Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor
W1
wa b
y wut but
WZ
i
y
Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor
(Non-linear) activation functions
W1
wa b
y wut but
WZ
i
y
Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor
The challenge: How do we update weights associated with each node in this multi-layer regime?
back-propagation = gradient descent + chain rule
Algorithm 27 ForwardPropagation(x)
1: for all input nodes u do 2:
hu ← corresponding feature of x
3: end for 4: for all nodes v in the network whose parent’s are computed do 5:
av ← ∑u∈par(v) w(u,v)hu
6:
hv ← tanh(av)
7: end for 8: return ay
Tanh is another common activation function
Algorithm 28 BackPropagation(x, y)
1: run ForwardPropagation(x) to compute activations 2: ey ← y − ay
// compute overall network error
3: for all nodes v in the network whose error ev is computed do 4:
for all u ∈ par(v) do
5:
gu,v ← −evhu
// compute gradient of this edge
6:
eu ← eu + evwu,v(1 − tanh2(au)) // compute the “error” of the parent node
7:
end for
8: end for 9: return all gradients ge
By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355
If you’re interested in learning more…