Neural Networks
Hugo Larochelle ( @hugo_larochelle ) Google Brain
Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - - PowerPoint PPT Presentation
Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos for a more detailed description of neural networks
Hugo Larochelle ( @hugo_larochelle ) Google Brain
2
Topics: online videos
description of neural networks…
http://info.usherbrooke.ca/hlarochelle/neural_networks
2
Topics: online videos
description of neural networks…
http://info.usherbrooke.ca/hlarochelle/neural_networks
3
...
xd
...
xj
1 1
... ...
1
... ... ...
x
Making predictions with feedforward neural networks
5
Topics: connection weights, bias, activation function
are the connection weights is the neuron bias is called the activation function
1
b w1
wd
d b w
i wixi = b + w>x
P
i wixi)
6
Topics: connection weights, bias, activation function
1
1
1
y1 x1 x2
(from Pascal Vincent’s slides)
range determined by bias only changes the position of the riff
·) b
7
Topics: single hidden layer neural network
1 1 1 1 .5
.7
x1 x2 x
1
1
1
1
1
1
1
1
1
y1 y2 z zk
wkj wji
x1 x2 x1 x2 x1 x2 y1 y2
sortie k entr´ ee i cach´ ee j biais
(from Pascal Vincent’s slides)
8
Topics: single hidden layer neural network
y1 y2 y4 y3 y3 y4 y2 y1 x1 x2 z1 z1 x1 x2
(from Pascal Vincent’s slides)
9
Topics: single hidden layer neural network
(from Pascal Vincent’s slides)
x1 ... x1 x2 R1 R2 R2 R1 x2
trois couches
10
Topics: universal approximation
any continuous function arbitrarily well, given enough hidden units’’
layer activation functions
algorithm that can find the necessary parameter values!
11
Topics: multilayer neural network
...
xd
...
xj
1 1
... ...
1
... ... ...
(h(0)(x) = x)
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
12
Topics: sigmoid activation function
pre-activation between 0 and 1
1 1+exp(a)
13
Topics: hyperbolic tangent (‘‘tanh’’) activation function
pre-activation between
negative
exp(a)+exp(a) = exp(2a)1 exp(2a)+1
14
Topics: rectified linear activation function
(always non-negative)
with sparse activities
15
Topics: softmax activation function
⇣
h
exp(a1) P
c exp(ac) . . .
exp(aC) P
c exp(ac)
i>
16
Topics: flow graph
represented as an acyclic flow graph
forward propagation in a modular way
that computes the value of the box given its parents
right order yield forward propagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
Training feedforward neural networks
18
Topics: empirical risk minimization, regularization
arg min
θ
1 T X
t
l(f(x(t); θ), y(t)) + λΩ(θ)
19
Topics: stochastic gradient descent (SGD)
✓ ✓
8
training epoch = iteration over all examples
20
Topics: loss function for classification
negative log-likelihood
natural log (ln)
y(t)
c 1(y=c) log f(x)c = log f(x)y
21
Topics: backpropagation algorithm
( = (e(y) f(x))
( = ra(k)(x) log f(x)y
( = W(k)> ra(k)(x) log f(x)y
( =
( =
22
Topics: sigmoid activation function gradient
1 1+exp(a)
23
Topics: tanh activation function gradient
exp(a)+exp(a) = exp(2a)1 exp(2a)+1
24
Topics: rectified linear activation function gradient
g0(a) = 1a>0
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
25
Topics: automatic differentiation
respect to each parent
while bprop depends the bprop of a box’s children
we get backpropagation
(1) x
(2) W(1)
(3) W(2)
(2) b(1)
(3) b(2)
x f(x)
26
Topics: L2 regularization
weights
k
P
i
P
j
⇣ W (k)
i,j
⌘2 = P
k ||W(k)||2 F
P P P ⇣
27
Topics: initialization
i,j
) U [b, b]
] b =
p 6
p
Hk+Hk−1
( see Glorot & Bengio, 2010) size of h(k)(x)
28
Topics: grid search, random search
select the best configuration
29
Topics: early stopping
set error increases (with some look ahead)
0.0 0.1 0.2 0.3 0.4 0.5 Training Validation
underfitting
number of epochs
30
Topics: normalization of data, decaying learning rate
(i) start with large learning rate (e.g. 0.1) (ii) maintain until validation error stops improving (iii) divide learning rate by 2 and go back to (ii)
31
Topics: mini-batch, momentum
(t) θ
= rθl(f(x(t)), y(t)) + r
(t1) θ
32
Topics: Adagrad, RMSProp, Adam
gradients
r
(t) θ
= rθl(f(x(t)), y(t)) p (t) + ✏ γ(t) = βγ(t−1) + (1 β) ⇣ rθl(f(x(t)), y(t)) ⌘2 γ(t) = γ(t−1) + ⇣ rθl(f(x(t)), y(t)) ⌘2 r
(t) θ
= rθl(f(x(t)), y(t)) p (t) + ✏
33
Topics: finite difference approximation
compare with a finite-difference approximation of the gradient
@x
2✏
⇡ ) x ⇡ x ✏
) f(x ✏)
⇡ x ✏
34
Topics: debugging on small dataset
small dataset (~50 examples)
Training deep feed-forward neural networks
Topics: inspiration from visual cortex
36
Topics: inspiration from visual cortex
36
Topics: inspiration from visual cortex
36
Topics: inspiration from visual cortex
36
Topics: inspiration from visual cortex
36
edges ...
Topics: inspiration from visual cortex
36
edges ... nose mouth eyes
Topics: inspiration from visual cortex
36
edges ... nose mouth eyes face
37
Topics: theoretical justification
(exponentially) more compactly
gates (i.e. AND, OR or NOT functions of their arguments)
38
Topics: success story: speech recognition
39
Topics: success story: computer vision
40
Topics: why training is hard
(underfitting)
propagation
recurrent neural networks
...
xd
...
xj
1 1
... ...
1
... ... ...
41
Topics: why training is hard
low variance/ high bias good trade-off high variance/ low bias
f
possible
f
possible
f
possible
41
Topics: why training is hard
low variance/ high bias good trade-off high variance/ low bias
f
possible
f
possible
f
possible
42
Topics: why training is hard
tend to dominate
43
Topics: why training is hard
tend to dominate
44
Topics: unsupervised pre-training
character image random image
44
Topics: unsupervised pre-training
character image random image Why is one a character and the other is not ?
45
Topics: unsupervised pre-training
character image random image Why is one a character and the other is not ?
46
Topics: autoencoder, encoder, decoder, tied weights
the output layer
bj
ck
x
W
W∗
h(x) = g(a(x)) = sigm(b + Wx) b x =
a(x)) = sigm(c + W∗h(x))
= W
(tied weights)
h(x) =
47
Topics: unsupervised pre-training
...
xd
...
xj
1
... ... ...
xd
...
xj
1
... ...
1
... ... ... ... ...
xd
...
xj
1
... ...
1 1
... ...
48
Topics: fine-tuning
a regular feed-forward network
at hand
... ... ...
xd
...
xj
1
... ...
1 1
... ... ...
1
49
Topics: impact of initialization
Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011
49
Topics: impact of initialization
Acts as a regularizer:
Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011
50
Topics: why training is hard
tend to dominate
51
Topics: dropout
removing hidden units stochastically
probability 0.5
units
useful
probability, but 0.5 usually works well
...
xd
...
xj
1 1
... ...
1
... ... ...
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
51
Topics: dropout
removing hidden units stochastically
probability 0.5
units
useful
probability, but 0.5 usually works well
...
xd
...
xj
1 1
... ...
1
... ... ...
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
51
Topics: dropout
removing hidden units stochastically
probability 0.5
units
useful
probability, but 0.5 usually works well
...
xd
...
xj
1 1
... ...
1
... ... ...
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
52
Topics: dropout
...
xd
...
xj
1 1
... ...
1
... ... ...
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
(h(0)(x) = x)
52
Topics: dropout
...
xd
...
xj
1 1
... ...
1
... ... ...
h(2)(x)
) W(1)
W(2)
W(3)
b(1)
b(2)
b(3)
x)
(h(0)(x) = x)
53
Topics: dropout backpropagation
( = (e(y) f(x))
( = ra(k)(x) log f(x)y
( = W(k)> ra(k)(x) log f(x)y
( =
( =
53
Topics: dropout backpropagation
( = (e(y) f(x))
( = ra(k)(x) log f(x)y
( = W(k)> ra(k)(x) log f(x)y
( =
( =
m(k−1)
x)
includes the mask m(k−1)
54
Topics: test time classification
neural networks, with all possible binary masks
Hinton, Srivastava, Krizhevsky, Sutskever and Salakhutdinov, 2012.
55
Topics: why training is hard
tend to dominate
55
Topics: why training is hard
tend to dominate
Batch normalization
56
Topics: batch normalization
(Lecun et al. 1998)
(Ioffe and Szegedy, 2014)
57
Topics: batch normalization
g ill it yer r- e h a- y Input: Values of x over a mini-batch: B = {x1...m}; Parameters to be learned: γ, β Output: {yi = BNγ,β(xi)} µB ← 1 m
m
xi // mini-batch mean σ2
B ← 1
m
m
(xi − µB)2 // mini-batch variance
B +
// normalize yi ← γ xi + β ≡ BNγ,β(xi) // scale and shift
57
Topics: batch normalization
g ill it yer r- e h a- y Input: Values of x over a mini-batch: B = {x1...m}; Parameters to be learned: γ, β Output: {yi = BNγ,β(xi)} µB ← 1 m
m
xi // mini-batch mean σ2
B ← 1
m
m
(xi − µB)2 // mini-batch variance
B +
// normalize yi ← γ xi + β ≡ BNγ,β(xi) // scale and shift
Learned linear transformation to adapt to non-linear activation function (𝛿 and β are trained)
58
Topics: online videos
description of neural networks…
http://info.usherbrooke.ca/hlarochelle/neural_networks
58
Topics: online videos
description of neural networks…
http://info.usherbrooke.ca/hlarochelle/neural_networks
59