(C) Reg (C) Regression ression, , layered layered ne neur ural netw tworks
- Networks of conti
tinuous units ts
- Reg
Regres ression ion problems
- Gradient
t descent, t, backpropagation of error
- The role of the learning rate
te
- O
(C) Reg (C) Regression ression, , layered layered ne neur ural - - PowerPoint PPT Presentation
(C) Reg (C) Regression ression, , layered layered ne neur ural netw tworks - Networks of conti tinuous units ts - Reg Regres ression ion problems - Gradient t descent, t, backpropagation of error - The role of the learning rate te - O
Neural Networks
2
Neural Networks
3
Neural Networks
4
Neural Networks
5
Neural Networks
6
j
Neural Networks
7
θ
j
Neural Networks
8
θ
j wijSj
j
Neural Networks
9
θ
j
Neural Networks
10
Neural Networks
11
Neural Networks
12
j
Neural Networks
13
N
j=1
Neural Networks
14
j
j
j
k=1
k
j
j
Neural Networks
15
µ=1
Neural Networks
16
P
µ=1
P
µ=1
Neural Networks
17
N
j=1
P
µ=1
P
µ=1
k
P
µ=1
18
Backpropagation of Error convenient calculation of the gradient in multilayer networks (← chain rule) example: continuous two-layer network with K hidden units inputs ξ ∈ I RN weights wk ∈ I RN, k = 1, 2, . . . , K hidden units σk(ξ) = g(wk · ξ)
σ(ξ) = h ⇣PK
j=1 vj g(wj · ξ)
⌘ derive and the weigths and are used ... – for the calculation of hidden states and output – for the calculation of the gradient convenient calculation of the gradient in multilayer networks ( chain rule) example: continuous two-layer network with hidden units inputs weights hidden units
derive and the weigths wk and vk are used ... – downward for the calculation of hidden states and output – upward for the calculation of the gradient
75
convenient calculation of the gradient in multilayer networks ( chain rule) example: continuous two-layer network with hidden units inputs weights hidden units
⇣P ⌘ Exercise: derive r wkE and ∂E
∂vk
the weigths and are used ... – for the calculation of hidden states and output – for the calculation of the gradient
Neural Networks
19
Neural Networks
20
21
22
assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }
=0
+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗
ij =
∂2 E ∂wi ∂wj
H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite
has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 expansion in : X with we obtain
Neural Networks
23
assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }
=0
+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗
ij =
∂2 E ∂wi ∂wj
H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite
has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 expansion in : X with we obtain
Neural Networks
24
assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }
=0
+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗
ij =
∂2 E ∂wi ∂wj
H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite
has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 δt ≈ [ I − η H∗ ] δt−1 ≈ [ I − η H∗ ]t δ0 expansion in { ui }: δ0 = X
i
ai ui X u X u with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite
has only pos. eigenvalues , orthonormal eigenvectors (all ) gradient descent in the vicinity of : expansion in : X
i
δt ≈ X
i
ai [ I − η H∗ ]t ui = X
i
ai [ 1 − η λi ]t ui with uT
j uk = δjk
we obtain | δt |2 = X
i
a2
i [ 1 − ηλi ]2t
25
iteration approaches the minimum, lim
t→∞ | δt | = 0,
| 1 − ηλi | < 1 for all i condition for (local) convergence: η < ηmax = 2 λmax smooth convergence
divergence iteration approaches the minimum, ,
for all condition for (local) convergence: η < ηmax 2 = 1 λmax 1 − ηλmax > 0 smooth convergence
divergence iteration approaches the minimum, ,
for all condition for (local) convergence: 1 λmax < η < 2 λmax 1 − ηλmax < 0 1 smooth convergence
divergence iteration approaches the minimum, ,
for all condition for (local) convergence: η > ηmax = 2 λmax 1 − ηλmax < −1 smooth convergence
divergence
26
... the above considerations
local minima can have completely different characteristics (λmax)
e.g. the choice of the learning rate far from a minimum potential problems:
gradient learning can slow down drastically by, e.g., plateau states, see below
27
some modifications:
momentum: ∆wt+1 = η r E + a ∆wt “keep going” sophisticated optimization methods: line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs ), ... for different weights, examples: – heuristics: for input-to-hidden, for hidden-to-output weights – simplified version of “matrix update” (assume is approximately diagonal): update each weight with a learning rate – learning algorithms realize in as long as construction of alternative
if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of
81
28
some modifications:
momentum: ∆wt+1 = η r E + a ∆wt “keep going”
line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ... for different weights, examples: – heuristics: for input-to-hidden, for hidden-to-output weights – simplified version of “matrix update” (assume is approximately diagonal): update each weight with a learning rate – learning algorithms realize in as long as construction of alternative
if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of
81
29
some modifications:
momentum: ∆wt+1 = η r E + a ∆wt “keep going”
line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ...
– heuristics: η / 1/N for input-to-hidden, η / 1/K for hidden-to-output weights – simplified version of “matrix update” (assume H is approximately diagonal): update each weight wj with a learning rate ηj / 1
∂w2
j
– learning algorithms realize descent in E as long as ∆w · r E < 0 construction of alternative well-behaved cost functions,
if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of
81
30
some modifications:
momentum: ∆wt+1 = η r E + a ∆wt “keep going”
line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ...
– heuristics: η / 1/N for input-to-hidden, η / 1/K for hidden-to-output weights – simplified version of “matrix update” (assume H is approximately diagonal): update each weight wj with a learning rate ηj / 1
∂w2
j
– learning algorithms realize descent in E as long as ∆w · r E < 0
E = X
µ
⇢ γ (σ τ)2 if sign(σ) = sign(τ) (σ τ)2 if sign(σ) 6= sign(τ) with γ increasing from 0 to 1. small γ: emphasis on correct sign of the output large γ: fine tuning of σ
81
Neural Networks
31
stochastic approximation (on-line gradient descent) cost function E =
1 P
PP
µ=1 eµ ≡ eµ
is an empirical average over examples → simple approximation of rE by reµ for one example only
1/P
wt+1 = wt + ∆wt = wt − η r eµ|wt – computationally cheap compared to gradient descent – , fewer problems with local minima, flat regions etc. (when) does the procedure converge? behavior close to a (local) minimum
?
82
Neural Networks
32
stochastic approximation (on-line gradient descent) cost function E =
1 P
PP
µ=1 eµ ≡ eµ
is an empirical average over examples → simple approximation of rE by reµ for one example only
1/P
wt+1 = wt + ∆wt = wt − η r eµ|wt – computationally cheap compared to off-line (batch) gradient descent – intrinsic noise, fewer problems with local minima, flat regions etc. (when) does the procedure converge? behavior close to a (local) minimum w∗ of E?
82
33
averaged learning step: ∆w = −η r eµ|w = − η P
P
X
µ=1
r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of : ( possible if all
for constant rate : (fluctuations remain non-zero) in the sense of
for
but is required satisfied by, e.g. for large e.g.
83
Neural Networks
34
averaged learning step: ∆w = −η r eµ|w = − η P
P
X
µ=1
r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of ∆w: (∆w)2 = η2 ( reµ|∗ )2 > 0 (0 is possible if all e µ=0) for constant rate η > 0: lim
t→∞ ( ∆wt )2 > 0
(fluctuations remain non-zero) in the sense of
for
but is required satisfied by, e.g. for large e.g.
83
Neural Networks
35
averaged learning step: ∆w = −η r eµ|w = − η P
P
X
µ=1
r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of ∆w: (∆w)2 = η2 ( reµ|∗ )2 > 0 (0 is possible if all e µ=0) for constant rate η > 0: lim
t→∞ ( ∆wt )2 > 0
(fluctuations remain non-zero) convergence in the sense of ( ∆w )2 → 0
η(t) → 0 for t → ∞
lim
t→∞
P
t η(t) → ∞ but
lim
t→∞
P
t η(t)2 < ∞
is required satisfied by, e.g. η(t) ∝ 1 t for large t learning rate schedules, e.g. η(t) = a b + t
83
36
Plateau states frequent observation: training of multilayer networks is delayed by quasi-stationary plateaus
(S.J. Hanson, in: Y. Chauvin and D.E. Rummelhart, Backpropagation: Theory, Architectures, and Applications, 1995)
84
37
example: a two-layer network trained from reliable, perfectly realizable data by on-line gradient descent
number of examples P/(KN)
(here: matching complexity)
unspecialized h.u. with wk∼wo+noise have all obtained some (the same) information about the unknown rule
the network output is invariant under permutations of hidden units perfectly symmetric state corresponds to a flat region (saddle) in E successful learning requires specialization and can be delayed significantly
.Riegler, C. W¨
85
analysed in depth in the statistical physics community (1990’s) problem re-discovered in deep learning
Neural Networks
Neural Networks
j x
Neural Networks
Neural Networks
j=1 exp [−β (x − cj)2]
M
j=1
Neural Networks
Neural Networks
Neural Networks
Neural Networks
Neural Networks