Lecture 11:
−Multi-layer Perceptron −Forward Pass −Backpropagation
Aykut Erdem
November 2016 Hacettepe University
Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation - - PowerPoint PPT Presentation
Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November 2016 Hacettepe University Administrative Assignment 2 due Nov. 10, 2016! Midterm exam on Monday, Nov. 14, 2016 You are responsible
−Multi-layer Perceptron −Forward Pass −Backpropagation
Aykut Erdem
November 2016 Hacettepe University
− You are responsible from the beginning till the end
− You can prepare and bring a full-page copy sheet
− It is due December 1, 2016 − You will implement a 2-layer Neural Network
2
3
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
4
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
5
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
6
http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
7
f(x) = X
i
wixi = hw, xi x1 x2 x3 xn . . .
w1 wn
slide by Alex Smola
8
9
10
1970s 1980s 1990s 2000s 2010s Data
102 103 105 108 1011
RAM
? 1MB 100MB 10GB 1TB
CPU
? 10MF 1GF 100GF 1PF GPU
at higher exponent
slide by Alex Smola
11
slide by Alex Smola
12
slide by Dhruv Batra
13
slide by Dhruv Batra
14
slide by Dhruv Batra
Cell body - combines signals
Combines the inputs from several other nerve cells
Interface and parameter store between neurons
May be up to 1m long and will transport the activation signal to neurons at different locations
15
slide by Alex Smola
16
slide by Dhruv Batra
loss function for gradient descent training.
17
Linear Neuron Logistic Neuron
Perceptron
y = θ0 + X
i
xiθi
y = ⇢ 1 if z ≥ 0
z = θ0 + X
i
xiθi
z = θ0 + X
i
xiθi y = 1 1 + e−z
slide by Dhruv Batra
18
slide by Dhruv Batra
19
y1i(x) = σ(hw1i, xi) y2(x) = σ(hw2, y1i) y1i = k(xi, x) Kernels
Deep Nets
slide by Alex Smola
20
y1i(x) = σ(hw1i, xi) y2i(x) = σ(hw2i, y1i) y3(x) = σ(hw3, y2i)
slide by Alex Smola
approximator (can represent any function).
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
units and more hidden layers
21
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
represent known shapes.
represent pixel intensities.
for several different shapes.
22
0 1 2 3 4 5 6 7 8 9
𝑦 𝑦 𝑦 𝑦 … …
slide by Geoffrey Hinton
23
Give each output unit its own “map” of the input image and display the weight coming from each pixel in the location of that pixel in the map. Use a black or white blob with the area representing the magnitude of the weight and the color representing the sign.
The input image
slide by Geoffrey Hinton
24
Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses.
The image
slide by Geoffrey Hinton
25
The image
slide by Geoffrey Hinton
26
The image
slide by Geoffrey Hinton
27
The image
slide by Geoffrey Hinton
28
The image
slide by Geoffrey Hinton
29
The image
slide by Geoffrey Hinton
30
The details of the learning algorithm will be explained later.
The image
slide by Geoffrey Hinton
need to learn the features that it is composed of.
31
slide by Geoffrey Hinton
32
linear mapping Wx and nonlinear function
to measure quality of estimate so far
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
l(y, yi)
slide by Alex Smola
33
(j indexing hidden units, k indexing the output units, D number of inputs)
34
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
X
= g(wk0 +
J
X
j=1
hj(x)wkj)
σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +
D
X
i=1
xivji)
biases and W3?
35
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
[http://cs231n.github.io/neural-networks-1/]
function?
36
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
= 1 1 + exp(−zk) zk = wk0 +
J
X
j=1
xjwkj
37
!vs.!all)?! ust!all!the!parameters! ,!to! ut!this!is!a!complicated!
= 1 1 + exp(−zk) zk = wk0 +
J
X
j=1
hj(x)wkj
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
s: P
k 1 2(o(n) k
− t(n)
k )2
P
(n)
where o = f(x;w) is the output of a neural network
where η is the learning rate (and E is error/loss)
38
w∗ = argmin
w N
X
n=1
loss(o(n), t(n))
2 k
−
k
s: − P
k t(n) k
log o(n)
k
P wt+1 = wt − η ∂E ∂wt
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
39
name function derivative Sigmoid σ(z) =
1 1+exp(−z)
σ(z) · (1 − σ(z)) Tanh tanh(z) = exp(z)−exp(−z)
exp(z)+exp(−z)
1/ cosh2(z) ReLU ReLU(z) = max(0, z) ( 1, if z > 0 0, if z ≤ 0
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler
40
41
0.09 2.9 4.48 8.02 3.78 1.06
6.04 5.31
3.58 4.49
3.42 4.64 2.65 5.1 2.64 5.55
6.14
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
quantifies our unhappiness with the scores across the training data.
efficiently finding the parameters that minimize the loss function. (optimization)
TODO:
We defined a (linear) score function:
42
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
43
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
44
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
45
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
46
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
47
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
48
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
49
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
50
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
51
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
52
53
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
54
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
55 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
56
(image credits to Alec Radford)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
57
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
58
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson