Lecture #03 – Multi-layer Perceptrons
Aykut Erdem // Hacettepe University // Spring 2018
CMP784
DEEP LEARNING
Image: Jose-Luis Olivares
CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation
Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 1 is out! Learning neural word embeddings Due Friday, Mar. 16,
Lecture #03 – Multi-layer Perceptrons
Aykut Erdem // Hacettepe University // Spring 2018
DEEP LEARNING
Image: Jose-Luis Olivares
Breaking news!
—Learning neural word embeddings —Due Friday, Mar. 16, 23:59:59
next week!
− Discuss your slides with me 3-4 days prior to your presentation − submit your final slides by the night before the class. − We don’t have any code walker or demonstrator.
2Previously on CMP784
Lecture overview
sclaimer: Much of the material and slides for this lecture were borrowed from
—Hugo Larochelle’s Neural networks slides —Nick Locascio’s MIT 6.S191 slides —Efstratios Gavves and Max Willing’s UvA deep learning class —Leonid Sigal’s CPSC532L class —Richard Socher’s CS224d class —Dan Jurafsky’s CS124 class
4A Brief History of Neural Networks
5Image: VUNI Inc.
today
The Perceptron
7x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Perceptron Forward Pass
(or input activation)
where
w are the weights (parameters) b is the bias term g(·) is called the activation function
8i wixi = b + w>x
P
i wixi)
x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Output Activation of The Neuron
9Bi t ri s ed
(x a(x
(from Pascal Vincent’s slides) Image credit: Pascal Vincent
i wixi)
Bias only changes the position of the riff Range is determined by g(·)
x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Linear Activation Function
10i wixi)
x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity tion
Sigmoid Activation Function
11x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
1 1+exp(a)
s
i wixi)
the neuron’s
between 0 and 1
positive
Increasing
Perceptron Forward Pass
122 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
i wixi)
Perceptron Forward Pass
132 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
(x
(2*0.1) + (3*0.5) + (-1*2.5) + (5*0.2) + (1*3.0)
Perceptron Forward Pass
142 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
h(x) = g(3.2) = σ(3.2) 1 1 + e−3.2 = 0.96
Hyperbolic Tangent (tanh) Activation Function
15x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
i wixi)
neuron’s output between
Increasing
h(a) = exp(a)exp(a)
exp(a)+exp(a) = exp(2a)1 exp(2a)+1
exp(a)+exp(a)
Rectified Linear (ReLU) Activation Function
16x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
i wixi)
by 0 (always non-negative)
bounded
increasing
produce units with sparse activities
Decision Boundary of a Neuron
—with sigmoid, one can interpret neuron as estimating p(y = 1 | x) —also known as logistic regression classifier —if activation is greater than 0.5, predict 1 —otherwise predict 0
Same idea can be applied to a tanh activation
17Image credit: Pascal Vincent
(from Pascal Vincent’s slides)
han
Decision boundary is linear
Capacity of Single Neuron
1 1 1 1 1 1
OR (x1, x2)
AND (x1, x2)
AND (x1, x2) (x1
, x2) , x2) , x2)
(x1 (x1
Capacity of Single Neuron
1 1
?
XOR (x1, x2)
(x1
1 1
XOR (x1, x2) AND (x1, x2)
AND (x1, x2)
, x2)
Perceptron Diagram Simplified
20x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Perceptron Diagram Simplified
21x0 x1 x2 xn
inputs …
Multi-Output Perceptron
—We need multiple outputs (1 output per class) —We need to estimate conditional probability p(y = c|x) —Discriminative Learning
—Strictly positive —sums to one
x0 x1 x2 xn
inputs …
h
exp(a1) P
c exp(ac) . . .
exp(aC) P
c exp(ac)
i>
Single Hidden Layer Neural Network
x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer
⇣ a(x)i = b(1)
i
+ P
j W (1) i,j xj
⌘
⇣ b(2) + w(2)h(1)x ⌘
>
Multi-Layer Perceptron (MLP)
layers.
—layer pre-activation for k>0 —hidden layer activation from 1 to L: —output layer activation (k=L+1)
25x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer
Deep Neural Network
26x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer h1 h2 h0 hn …
Capacity of Neural Networks
Image credit: Pascal Vincent
R´ eseaux de neurones
1 1 1 1 .5x1 x2 x
1y1 y2 z zk
wkj wji
x1 x2 x1 x2 x1 x2 y1 y2
sortie k entr´ ee i cach´ ee j biais
Input Hidden Output bias
(from Pascal Vincent’s slides)
Capacity of Neural Networks
Image credit: Pascal Vincent
y1 y2 y4 y3 y3 y4 y2 y1 x1 x2 z1 z1 x1 x2
(from Pascal Vincent’s slides)
Capacity of Neural Networks
Image credit: Pascal Vincent (from Pascal Vincent’s slides)
Universal Approximation
—“a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’
find the necessary parameter values.
30Example Problem: Will my flight be delayed?
32Example Problem: Will my Flight be Delayed?
x0 x1 h2 h1
h0 Predicted: 0.05 [-20, 45] Temperature: -20 F Wind Speed: 45 mph
Example Problem: Will my flight be delayed?
33Example Problem: Will my Flight be Delayed?
x0 x1 h2 h1
h0 Predicted: 0.05 [-20, 45] Predicted: 0.05
[-20, 45]
x0 x1 h1 h2 h0
Example Problem: Will my flight be delayed?
34Example Problem: Will my Flight be Delayed?
x0 x1 h2 h1
h0 Predicted: 0.05 [-20, 45] Predicted: 0.05 Actual: 1
[-20, 45]
x0 x1 h1 h2 h0
Quantifying Loss
35Predicted: 0.05 Actual: 1
[-20, 45]
x0 x1 h1 h2 h0
Predicted
`(f(x(i); ✓), y(i))
Actual
Total Loss
36[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Predicted Actual
J(✓) = 1 N X
i
`(f(x(i); ✓), y(i))
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Total Loss
37[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Predicted Actual
J(✓) = 1 N X
i
`(f(x(i); ✓), y(i))
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Binary Cross Entropy Loss
38[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Jcross entropy(θ) = 1 N X
i
y(i) log(f(x(i); θ)) + (1 − y(i)) log(1 − f(x(i); θ)))
Binary Cross Entropy Loss
39[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
JMSE(θ) = 1 N X
i
⇣ f(x(i); θ) − y(i)⌘2
arg min
θ
1 T X
t
l(f(x(t); θ), y(t)) + λΩ(θ)
Training
—For classification problems, we would like to minimize classification error —Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound)
41Loss function Regularizer
Training Neural Networks: Objective
MIT 6.S191 | Intro to Deep Learning | IAP 2017
loss function
=
Training Neural Networks: Objective
MIT 6.S191 | Intro to Deep Learning | IAP 2017
Loss is a function of the model’s parameters
42Loss is a function of the model’s parameters
MIT 6.S191 | Intro to Deep Learning | IAP 2017
Loss is a function of the model’s parameters
MIT 6.S191 | Intro to Deep Learning | IAP 2017
How to minimize loss?
43Loss is a function of the model’s parameters
MIT 6.S191 | Intro to Deep Learning | IAP 2017
Start at random point
How to minimize loss?
44Compute:
How to minimize loss?
45Move in direction opposite
How to minimize loss?
46Move in direction opposite
How to minimize loss?
47Repeat!
This is called Stochastic Gradient Descent (SGD)
48Repeat!
Stochastic Gradient Descent (SGD)
○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
MIT 6.S191 | Intro to Deep Learning | IAP 2017
○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
MIT 6.S191 | Intro to Deep Learning | IAP 2017
𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ
Why is it Stochastic Gradient Descent?
○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
MIT 6.S191 | Intro to Deep Learning | IAP 2017 Only an estimate of true gradient!
𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ
𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ
○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
MIT 6.S191 | Intro to Deep Learning | IAP 2017 More accurate estimate!
○ For each training batch {(x0, y0), … , (xB, yB)}: ■ Compute Loss Gradient: ■ Update θ with update rule:
Minibatches Reduce Gradient Variance
MIT 6.S191 | Intro to Deep Learning | IAP 2017
More accurate estimate!
Advantages:
⎯ Smoother convergence ⎯ Allows for larger learning rates
⎯ Can parallelize computation + achieve significant speed increases on GPU’s
θ)
Training epoch = Iteration of all examples
Stochastic Gradient Descent (SGD)
—initialize —for N iterations
—for each training example or batch
—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )
52ze:
mple
r 8
Training epoch = Iteration over all all examples
ent: ,
,
r
ion:
𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ
θ)
Training epoch = Iteration of all examples
Stochastic Gradient Descent (SGD)
—initialize —for N iterations
—for each training example or batch
—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )
53ze:
mple
r 8
Training epoch = Iteration over all all examples
ent: ,
,
r
ion:
𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ
What is a neural network again?
functions
aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
Neural network models
h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Forward connections (Feedforward architecture)
Neural network models
Input Interweaved connections (Directed Acyclic Graph architecture – DAGNN)
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
Neural network models
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Loopy connections (Recurrent architecture, special care needed)
h1(xi; θ)
Neural network models
Functions → Modules
h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
What is a module
⎯ Contains trainable parameters (") ⎯ Receives as an argument an input $ ⎯ And returns an output ! based on the activation function h(...)
differentiable (almost) everywhere
→ store module input
⎯ easy to get module output fast ⎯ easy to compute derivatives
59Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
Anything goes or do special constraints exist?
computations
form recu current connections (revisited later)
60What is a module
network
module h!(... )
with th th the right t order
Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ) where
al = hl(xl; θ) al = xl+1 xl = al−1
What is a module
for our data
module !h"(#";$") w.r.t. their inputs #" and parameters $"
and traverse it backwards
their gradients
with the backpropagation algorithm
62h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) h2(xi; θ) h4(xi; θ)
dLoss(Input)
Again, what is a neural network again?
⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
we need the gradients
functions, like "# (... )?
63aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
✓ θt+1 = θt − ηt ∂L ∂θt ◆
∂L ∂θl , l = 1, . . . , L
Again, what is a neural network again?
⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
we need the gradients
compute the grad adients s for su such ch a a co complicat cated funct ction encl closi sing other funct ctions, s, like ke "# (... )? ?
64aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
✓ θt+1 = θt − ηt ∂L ∂θt ◆
∂L ∂θl , l = 1, . . . , L