Aykut Erdem // Hacettepe University // Fall 2019
Lecture 11:
Multi-layer Perceptron Forward Pass
BBM406
Fundamentals of Machine Learning
Image: Jose-Luis Olivares
BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer - - PowerPoint PPT Presentation
Image: Jose-Luis Olivares BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass Aykut Erdem // Hacettepe University // Fall 2019 Last time Linear Discriminant Function Linear
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 11:
Multi-layer Perceptron Forward Pass
Image: Jose-Luis Olivares
Last time… Linear Discriminant Function
where w is called weight vector, and w0 is a bias.
where step function sign(·) is defined as
3
y(x) = wTx + w0
C(x) = sign(wTx + w0)
sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce LiuwTx kwk = w0 kwk the decision surface.
Last time… Properties of Linear Discriminant Functions
from the origin to the decision surface is
4
x2 x1 w x
y(x) kwkx?
w0 kwky = 0 y < 0 y > 0 R2 R1
is perpendicular to w, and its displacement from the origin is controlled by the bias parameter
a general point x from the decision surface is given by y(x)/||w||
perpendicular distance r of the point x from the decision surface
slide by Ce LiuLast time… Multiple Classes: Simple Extension
5
R1 R2 R3 ? C1 not C1 C2 not C2
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
not in Ck.
Last time… Multiple Classes: K-Class Discriminant
by yk(x) = yj(x)
6
yk(x) = wT
k x + wk0
C(x) = k, if yk(x) > yj(x) 8 j 6= k
C C (wk wj)Tx + (wk0 wj0) = 0
slide by Ce Liuy = wTx
Last time…Fisher’s Linear Discriminant
can be maximally separated
7
−2 2 6 −2 2 4 −2 2 6 −2 2 4Difference of means Fisher’s Linear Discriminant
m1 = 1 N1 X
n∈C1
xn, m2 = 1 N2 X
n∈C2
xn A way to view a linear classification model is in terms of dimensionality reduction.
slide by Ce LiuJ(w) = Between-class variance Within-class variance = wTSBw wTSWw
Last time… Linear classification
8
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson9
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonLast time… Linear classification
10
Last time… Linear classification
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson11
http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson12
f(x) = X
i
wixi = hw, xi x1 x2 x3 xn . . .
w1 wn
synaptic weights
slide by Alex Smola13
14
15
1970s 1980s 1990s 2000s 2010s Data
102 103 105 108 1011
RAM
? 1MB 100MB 10GB 1TB
CPU
? 10MF 1GF 100GF 1PF GPU
at higher exponent
deep nets kernel methods deep nets
slide by Alex Smola16
slide by Alex Smola17
slide by Dhruv BatraOption 1 — Non-linear features
18
spaces
Option 2 — Non-linear classifiers
19
parameters w, e.g.,
Cell body - combines signals
Combines the inputs from several other nerve cells
Interface and parameter store between neurons
May be up to 1m long and will transport the activation signal to neurons at different locations
20
slide by Alex Smolanode
21
slide by Dhruv Batra22
Linear Neuron
θ1 θ2 θD θ0 1 f(~ x, ✓)
y = θ0 + X
i
xiθi
slide by Dhruv Batra23
Linear Neuron
θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)
Perceptron
y = θ0 + X
i
xiθi
y = ⇢ 1 if z ≥ 0
z = θ0 + X
i
xiθi
slide by Dhruv Batra24
Linear Neuron Logistic Neuron
θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)
Perceptron
y = θ0 + X
i
xiθi
y = ⇢ 1 if z ≥ 0
z = θ0 + X
i
xiθi
z = θ0 + X
i
xiθi y = 1 1 + e−z
slide by Dhruv Batraloss function for gradient descent training.
25
Linear Neuron Logistic Neuron
θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)
Perceptron
y = θ0 + X
i
xiθi
y = ⇢ 1 if z ≥ 0
z = θ0 + X
i
xiθi
z = θ0 + X
i
xiθi y = 1 1 + e−z
slide by Dhruv Batraboundary
26
slide by Dhruv Batra27
y1i(x) = σ(hw1i, xi) y2(x) = σ(hw2, y1i) y1i = k(xi, x) Kernels
Deep Nets
all weights
slide by Alex Smola28
y1i(x) = σ(hw1i, xi) y2i(x) = σ(hw2i, y1i) y3(x) = σ(hw3, y2i)
slide by Alex Smolaapproximator (can represent any function).
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
units and more hidden layers
29
slide by Raquel Urtasun, Richard Zemel, Sanja Fidlerwith two layers of neurons.
represent known shapes.
represent pixel intensities.
ink on it.
for several different shapes.
most votes wins.
30
0 1 2 3 4 5 6 7 8 9
𝑦 𝑦 𝑦 𝑦 … …
slide by Geoffrey Hinton31
Give each output unit its own “map” of the input image and display the weight coming from each pixel in the location of that pixel in the map. Use a black or white blob with the area representing the magnitude of the weight and the color representing the sign.
The input image
1 2 3 4 5 6 7 8 9 0
slide by Geoffrey Hinton32
Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
slide by Geoffrey Hinton33
1 2 3 4 5 6 7 8 9 0
The image
slide by Geoffrey Hinton34
1 2 3 4 5 6 7 8 9 0
The image
slide by Geoffrey Hinton35
1 2 3 4 5 6 7 8 9 0
The image
slide by Geoffrey Hinton36
1 2 3 4 5 6 7 8 9 0
The image
slide by Geoffrey Hinton37
1 2 3 4 5 6 7 8 9 0
The image
slide by Geoffrey Hinton38
1 2 3 4 5 6 7 8 9 0
The details of the learning algorithm will be explained later.
The image
slide by Geoffrey Hintonlayer is equivalent to having a rigid template for each shape.
much too complicated to be captured by simple template matches of whole shapes.
need to learn the features that it is composed of.
39
slide by Geoffrey Hinton40
linear mapping Wx and nonlinear function
to measure quality of estimate so far
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
l(y, yi)
slide by Alex Smola41
Forward Pass: What does the Network Compute?
(j indexing hidden units, k indexing the output units, D number of inputs)
42
slide by Raquel Urtasun, Richard Zemel, Sanja FidlerX
= g(wk0 +
J
X
j=1
hj(x)wkj)
σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +
D
X
i=1
xivji)
biases and W3?
43
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler[http://cs231n.github.io/neural-networks-1/]
function?
44
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler= 1 1 + exp(−zk) zk = wk0 +
J
X
j=1
xjwkj
45
!vs.!all)?! ust!all!the!parameters! ,!to! ut!this!is!a!complicated!
= 1 1 + exp(−zk) zk = wk0 +
J
X
j=1
hj(x)wkj
slide by Raquel Urtasun, Richard Zemel, Sanja Fidlerw∗ = argmin
w N
X
n=1
loss(o(n), t(n))
s: P
k 1 2(o(n) k
− t(n)
k )2
P
(n)
where o = f(x;w) is the output of a neural network
where η is the learning rate (and E is error/loss)
46
2 k
−
k
s: − P
k t(n) k
log o(n)
k
P wt+1 = wt − η ∂E ∂wt
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler47
name function derivative Sigmoid σ(z) =
1 1+exp(−z)
σ(z) · (1 − σ(z)) Tanh tanh(z) = exp(z)−exp(−z)
exp(z)+exp(−z)
1/ cosh2(z) ReLU ReLU(z) = max(0, z) ( 1, if z > 0 0, if z ≤ 0
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler