(C) Reg (C) Regression ression, , layered layered ne neur ural - - PowerPoint PPT Presentation

c reg c regression ression layered layered ne neur ural
SMART_READER_LITE
LIVE PREVIEW

(C) Reg (C) Regression ression, , layered layered ne neur ural - - PowerPoint PPT Presentation

(C) Reg (C) Regression ression, , layered layered ne neur ural netw tworks - Networks of conti tinuous units ts - Reg Regres ression ion problems - Gradient t descent, t, backpropagation of error - The role of the learning rate te - O


slide-1
SLIDE 1

(C) Reg (C) Regression ression, , layered layered ne neur ural netw tworks

  • Networks of conti

tinuous units ts

  • Reg

Regres ression ion problems

  • Gradient

t descent, t, backpropagation of error

  • The role of the learning rate

te

  • O

Onlin line learn e learnin ing, , stochastic approximation

slide-2
SLIDE 2

Neural Networks

2

Of Neurons and Netw tworks

biolog biological n ical neu euron rons (very brief)

  • single neurons
  • synapses and networks
  • synaptic plasticity and learning

simplified descripti tion

  • inspirati

tion for arti tificial neural netw tworks arti tificial neural netw tworks

  • architectures and types of networks:

recurrent attr ttracto tor neural netw tworks (associative memory) feed-forward neural netw tworks (classification/ regression)

slide-3
SLIDE 3

Neural Networks

synaptic cleft post-synaptic pre-synaptic axon soma axon branches dendrites

ne neur urons: ns: highly specialized cells

  • cell body so

soma ma

  • incoming dendrite

tes

  • branched axon

axon ma many ny ne neur urons ns ! ≳ 1012 in human cortex highly connecte ted ! ≳ 1000 neighbors

Of Neurons and Netw tworks

3

acti tion pote tenti tials / spikes:

∙ travel along

the axon

∙ cells generate

electric pulses

slide-4
SLIDE 4

Neural Networks

receptors transmitter synaptic cleft vesicles

∙ pre-synaptic pulse arriving at

excita tato tory /inhibito tory synapse triggers / hinders post-synaptic spike generation

pre-synaptic post-synaptic sy synap napses: ses:

4

excitatory: increase ∙ incoming the postsynaptic pulse membrane potential inhibitory: decrease ∙ all or nothing response potential exceeds th threshold ⇨ postsynaptic neuron fires potential is sub-th threshold ⇨ posts tsynapti tic neuron rests

Of Neurons and Netw tworks

slide-5
SLIDE 5

Neural Networks

5

simplified description of neural activity: firing rate tes time [ms] e.g. spikes / ms mean activity single spikes S(t)

Of Neurons and Netw tworks

slide-6
SLIDE 6

Neural Networks

6

(mean) local pote tenti tial at neuron i (with activity Si ) weighte ted sum of incoming activities synapti tic weights ts excita tato tory synapse inhibito tory synapse j i

X

j

wijSj

wij =    > 0 = 0 < 0

Of Neurons and Netw tworks

slide-7
SLIDE 7

Neural Networks

7

no non-l n-line near resp sponse nse: ∙ maximal activity h(x→+∞) ≡ 1 ∙ monotonic increase h’(x) > 0 ∙ minimal activity h(x→-∞) ≡ 0 important class of fcts.: sigmo sigmoid idal al acti tivati tion just one example:

θ

gain parameter Υ local threshold θ 1

Υ

Acti tivati tion Functi tion

Si = h hP

j wijSj

i

xi = X

j

wijSj

h(xi) = 1 2 ⇣ 1 + tanh [γ(xi − θ)] ⌘

slide-8
SLIDE 8

Neural Networks

8

no non-l n-line near resp sponse nse: ∙ maximal activity g(x→+∞) ≡ 1 ∙ monotonic increase g’(x) > 0 ∙ minimal activity g(x→-∞) ≡ -1 sigmo sigmoid idal al acti tivati tion

θ

gain parameter Υ local threshold θ

  • 1

1

Υ

just one example:

Si = g hP

j wijSj

i xi = X

j

wijSj

g(xi) = tanh [γ(xi − θ)]

Acti tivati tion Functi tion

slide-9
SLIDE 9

Neural Networks

9

an extreme case: infinite te gain

θ

local threshold θ ( don’t confuse θ with the all-or-nothing threshold in spiking neurons )

  • 1

1

McCulloch Pitts tts Neurons

McCulloch Pitts tts [1943]: model neuron is either quiescent or maximally active do not consider graded response

xi = X

j

wijSj

g(xi) = tanh [γ(xi − θ)] → sign [x − θ] = ⇢ +1 for x ≥ θ −1 for x < θ

γ → ∞

slide-10
SLIDE 10

Neural Networks

10

Synapti tic Plasti ticity ty

D.

  • D. Heb

Hebb [1949] [1949] consider - presynaptic neuron A

  • postsynaptic neuron B
  • excitatory synapse

A B Hypothesis: Heb Hebbian ian Learn Learnin ing If A and B (frequently) fire at the same time the excitatory synaptic strength increases → memory-effect will favor joint activity in the future

wAB wBA

change of synaptic strength

∆wBA ∝ SASB

−1 ≤ SA, SB ≤ +1

For symmetrized firing rates pre-synaptic x post-synaptic activity

slide-11
SLIDE 11

Neural Networks

11

Arti tificial Neural Netw tworks

in the following:

  • assembled from simple firing rate neurons
  • connected by weights, real valued synaptic strenghts
  • various architectures and types of networks

e.g.: attr ttracto tor neural netw tworks, recurrent t netw tworks here: N=5 neurons partial connectivity

wij

Si(t) Sj(t)

dynamical systems, e.g. Hopfield model: network of McCulloch Pitts neurons, can operate as Associative Memory by learning of synaptic interactions

slide-12
SLIDE 12

Neural Networks

12

feed-forward netw tworks

layered archite tectu ture (here: 6-3-4-1) directe ted connecti tions (here: only to next layer) input t layer (external stimulus) hidden units ts (internal representation)

  • utp

tput t unit( t(s) (function of input vector) ↑ previous layer only

Si = g @X

j

wijSj 1 A

wij

slide-13
SLIDE 13

Neural Networks

13

input t units ts weights ts single outp tput t unit

  • utput = “linear separable functi

tion” of input variables parameterized by the weight vector and threshold θ

th the perceptr tron revisite ted ξj ∈ I R wj ∈ I R, w ∈ I RN S = sign @

N

X

j=1

wjξj − θ 1 A w

slide-14
SLIDE 14

Neural Networks

14

input t units ts input t to to hidden weights ts single outp tput t unit hidden layer units ts hidden to to outp tput t weights ts

  • utput = non-linear functi

tion of input variables: parameterized by set of all weights (and threshold)

convergent t tw two-layer archite tectu ture ξj ∈ I R, ξ ∈ I RN

w(k)

j

vk

Sk = g @X

j

w(k)

j

ξj 1 A

σ

σ = g K X

k=1

vkSk ! = g @X

k

vk g @X

j

w(k)

j

ξj 1 A 1 A

slide-15
SLIDE 15

Neural Networks

15

netw tworks of conti tinuous nodes

continuous activation functions, e.g. for all nodes in the network given a network architecture, the weights (and thresholds) parameterize a function (input/output relation): (here: single output unit) Learning as reg regression ression problem problem set of examples with real-valued labels tr training: (approximately) implement generalizati tion: application to novel data

g(x) = tanh (γx)

ξ ∈ I RN → σ(ξ) ∈ I R

{ξµ, τ µ = τ(ξµ)}P

µ=1

µ, τ µ =

σ(ξµ) = τ(ξµ) for all µ

σ(ξ) ≈ τ(ξ)

slide-16
SLIDE 16

Neural Networks

16

error measure and tr training

training strategy: employ an error m error measu easure re for comparison of student/teacher outputs just one very popular and plausible choice: quadrati tic deviati tion: cost t functi tion:

  • defined for a given set of example data
  • guides the training process
  • is a differenti

tiable functi tion of weights and thresholds

  • training by gradient

t descent t minimization of E

e(σ, τ) = 1 2 (σ − τ)2

E = 1 P

P

X

µ=1

eµ = 1 P

P

X

µ=1

1 2 ⇣ σ(ξµ) − τ(ξµ) ⌘2

slide-17
SLIDE 17

Neural Networks

17

. . . . . .

a single unit t

ξj ∈ I R, ξ ∈ I RN

w ∈ I RN

σ = g @

N

X

j=1

wj ξj 1 A

E(w) = 1 P

P

X

µ=1

1 2 ⇣ g(w · ξµ) − τ µ⌘2

∂E(w) ∂wk = 1 P

P

X

µ=1

⇣ g(w · ξµ) − τ µ⌘ g0(w · ξµ) ξµ

k

rwE(w) = 1 P

P

X

µ=1

⇣ g(w · ξµ) τ µ⌘ g0(w · ξµ) ξµ

slide-18
SLIDE 18

18

Backpropagation of Error convenient calculation of the gradient in multilayer networks (← chain rule) example: continuous two-layer network with K hidden units inputs ξ ∈ I RN weights wk ∈ I RN, k = 1, 2, . . . , K hidden units σk(ξ) = g(wk · ξ)

  • utput

σ(ξ) = h ⇣PK

j=1 vj g(wj · ξ)

⌘ derive and the weigths and are used ... – for the calculation of hidden states and output – for the calculation of the gradient convenient calculation of the gradient in multilayer networks ( chain rule) example: continuous two-layer network with hidden units inputs weights hidden units

  • utput

derive and the weigths wk and vk are used ... – downward for the calculation of hidden states and output – upward for the calculation of the gradient

75

convenient calculation of the gradient in multilayer networks ( chain rule) example: continuous two-layer network with hidden units inputs weights hidden units

  • utput

⇣P ⌘ Exercise: derive r wkE and ∂E

∂vk

the weigths and are used ... – for the calculation of hidden states and output – for the calculation of the gradient

slide-19
SLIDE 19

Neural Networks

19

A.E. Bryson, Y.-C. Ho (1969) (1969) Applied optimal control: optimization, estimation and control. Blaisdell Publishing, p 481

  • P. Werbos (1974).

(1974). Beyond regression: New Tools for Prediction and Analysis in Behavorial Sciences PhD thesis, Harvard University D.E. Rumelhart, G.E. Hinton, R.J. Williams (1986) (1986) Learning representations by backpropagating errors. Nature 323 (6088): 533-536

backpropagati tion

slide-20
SLIDE 20

Neural Networks

20

1995 1987

backpropagati tion

slide-21
SLIDE 21

21

negative gradient gives the direction of steepest descent in E simple gradient based minimization of E: sequence w0 → w1 → . . . → wt → wt+1 → . . . with wt+1 = wt − η r E|wt approaches some minimum of E (?) learning rate rate η – controls the step size of the algorithm – has to be small enough to ensure convergence – should be as large as possible to facilitate fast learning

slide-22
SLIDE 22

22

assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }

=0

+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗

ij =

∂2 E ∂wi ∂wj

H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite

  • f second derivatives

has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 expansion in : X with we obtain

wt = wt−1

slide-23
SLIDE 23

Neural Networks

23

assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }

=0

+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗

ij =

∂2 E ∂wi ∂wj

H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite

  • f second derivatives

has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 expansion in : X with we obtain

= [I − ηH∗] δt−1

slide-24
SLIDE 24

Neural Networks

24

assume E has a (local) minimum in w∗, Taylor expansion in the vicinity: E(w) ≈ E(w∗) + ( w − w∗ )T r E|∗ | {z }

=0

+ 1 2 ( w − w∗ )T H∗ ( w − w∗ ) + . . . E(w) ≈ E(w∗) + 1 2 ( w − w∗ )T H∗ ( w − w∗ ) r E|w ≈ H∗ ( w − w∗ ) with the positive definite Hesse matrix of second derivatives H∗

ij =

∂2 E ∂wi ∂wj

H∗ has only pos. eigenvalues λi > 0, orthonormal eigenvectors ui (all λi ≤ λmax) gradient descent in the vicinity of : expansion in : with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite

  • f second derivatives

has only pos. eigenvalues , orthonormal eigenvectors (all ≤ ) gradient descent in the vicinity of w∗: wt − w∗ ≡ δt = δt−1 − η r E|wt−1 δt ≈ [ I − η H∗ ] δt−1 ≈ [ I − η H∗ ]t δ0 expansion in { ui }: δ0 = X

i

ai ui X u X u with we obtain assume has a (local) minimum in , Taylor expansion in the vicinity: with the positive definite

  • f second derivatives

has only pos. eigenvalues , orthonormal eigenvectors (all ) gradient descent in the vicinity of : expansion in : X

i

δt ≈ X

i

ai [ I − η H∗ ]t ui = X

i

ai [ 1 − η λi ]t ui with uT

j uk = δjk

we obtain | δt |2 = X

i

a2

i [ 1 − ηλi ]2t

slide-25
SLIDE 25

25

iteration approaches the minimum, lim

t→∞ | δt | = 0,

  • nly if

| 1 − ηλi | < 1 for all i condition for (local) convergence: η < ηmax = 2 λmax smooth convergence

  • scillations

divergence iteration approaches the minimum, ,

  • nly if

for all condition for (local) convergence: η < ηmax 2 = 1 λmax 1 − ηλmax > 0 smooth convergence

  • scillations

divergence iteration approaches the minimum, ,

  • nly if

for all condition for (local) convergence: 1 λmax < η < 2 λmax 1 − ηλmax < 0 1 smooth convergence

  • scillations

divergence iteration approaches the minimum, ,

  • nly if

for all condition for (local) convergence: η > ηmax = 2 λmax 1 − ηλmax < −1 smooth convergence

  • scillations

divergence

slide-26
SLIDE 26

26

... the above considerations

  • are only valid close to the minimum

local minima can have completely different characteristics (λmax)

  • do not concern global convergence properties

e.g. the choice of the learning rate far from a minimum potential problems:

  • E can have (many) local minima far from global optimality
  • initial conditions determine which minimum will be approached
  • anistropic curvatures can cause strong oscillations
  • E can have saddle points with r E = 0 and/or flat regions with r E ≈ 0

gradient learning can slow down drastically by, e.g., plateau states, see below

slide-27
SLIDE 27

27

some modifications:

  • improved gradient descent: e.g. time dependent η(t)

momentum: ∆wt+1 = η r E + a ∆wt “keep going” sophisticated optimization methods: line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs ), ... for different weights, examples: – heuristics: for input-to-hidden, for hidden-to-output weights – simplified version of “matrix update” (assume is approximately diagonal): update each weight with a learning rate – learning algorithms realize in as long as construction of alternative

  • ne example:

if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of

81

slide-28
SLIDE 28

28

some modifications:

  • improved gradient descent: e.g. time dependent η(t)

momentum: ∆wt+1 = η r E + a ∆wt “keep going”

  • sophisticated optimization methods:

line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ... for different weights, examples: – heuristics: for input-to-hidden, for hidden-to-output weights – simplified version of “matrix update” (assume is approximately diagonal): update each weight with a learning rate – learning algorithms realize in as long as construction of alternative

  • ne example:

if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of

81

slide-29
SLIDE 29

29

some modifications:

  • improved gradient descent: e.g. time dependent η(t)

momentum: ∆wt+1 = η r E + a ∆wt “keep going”

  • sophisticated optimization methods:

line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ...

  • different learning rates for different weights, examples:

– heuristics: η / 1/N for input-to-hidden, η / 1/K for hidden-to-output weights – simplified version of “matrix update” (assume H is approximately diagonal): update each weight wj with a learning rate ηj / 1

  • ∂2E

∂w2

j

– learning algorithms realize descent in E as long as ∆w · r E < 0 construction of alternative well-behaved cost functions,

  • ne example:

if if with increasing from to . small : emphasis on correct sign of the output large : fine tuning of

81

slide-30
SLIDE 30

30

some modifications:

  • improved gradient descent: e.g. time dependent η(t)

momentum: ∆wt+1 = η r E + a ∆wt “keep going”

  • sophisticated optimization methods:

line search procedures, conjugate gradient, second order methods, e.g. Newton’s method (“matrix update” employs H), ...

  • different learning rates for different weights, examples:

– heuristics: η / 1/N for input-to-hidden, η / 1/K for hidden-to-output weights – simplified version of “matrix update” (assume H is approximately diagonal): update each weight wj with a learning rate ηj / 1

  • ∂2E

∂w2

j

– learning algorithms realize descent in E as long as ∆w · r E < 0

  • construction of alternative well-behaved cost functions,
  • ne example:

E = X

µ

⇢ γ (σ τ)2 if sign(σ) = sign(τ) (σ τ)2 if sign(σ) 6= sign(τ) with γ increasing from 0 to 1. small γ: emphasis on correct sign of the output large γ: fine tuning of σ

81

slide-31
SLIDE 31

Neural Networks

31

stochastic approximation (on-line gradient descent) cost function E =

1 P

PP

µ=1 eµ ≡ eµ

is an empirical average over examples → simple approximation of rE by reµ for one example only

  • select one µ ∈ { 1, 2, . . . , P } with equal probabilty

1/P

  • single step:

wt+1 = wt + ∆wt = wt − η r eµ|wt – computationally cheap compared to gradient descent – , fewer problems with local minima, flat regions etc. (when) does the procedure converge? behavior close to a (local) minimum

  • f

?

82

sto tochasti tic gradient t descent t

slide-32
SLIDE 32

Neural Networks

32

stochastic approximation (on-line gradient descent) cost function E =

1 P

PP

µ=1 eµ ≡ eµ

is an empirical average over examples → simple approximation of rE by reµ for one example only

  • select one µ ∈ { 1, 2, . . . , P } with equal probabilty

1/P

  • single step:

wt+1 = wt + ∆wt = wt − η r eµ|wt – computationally cheap compared to off-line (batch) gradient descent – intrinsic noise, fewer problems with local minima, flat regions etc. (when) does the procedure converge? behavior close to a (local) minimum w∗ of E?

82

sto tochasti tic gradient t descent t

slide-33
SLIDE 33

33

averaged learning step: ∆w = −η r eµ|w = − η P

P

X

µ=1

r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of : ( possible if all

  • )

for constant rate : (fluctuations remain non-zero) in the sense of

  • nly if

for

  • ne can show:

but is required satisfied by, e.g. for large e.g.

83

slide-34
SLIDE 34

Neural Networks

34

averaged learning step: ∆w = −η r eµ|w = − η P

P

X

µ=1

r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of ∆w: (∆w)2 = η2 ( reµ|∗ )2 > 0 (0 is possible if all e µ=0) for constant rate η > 0: lim

t→∞ ( ∆wt )2 > 0

(fluctuations remain non-zero) in the sense of

  • nly if

for

  • ne can show:

but is required satisfied by, e.g. for large e.g.

83

slide-35
SLIDE 35

Neural Networks

35

averaged learning step: ∆w = −η r eµ|w = − η P

P

X

µ=1

r eµ|w = −η r E|w ∆w = 0 for w → w∗ averaged length of ∆w: (∆w)2 = η2 ( reµ|∗ )2 > 0 (0 is possible if all e µ=0) for constant rate η > 0: lim

t→∞ ( ∆wt )2 > 0

(fluctuations remain non-zero) convergence in the sense of ( ∆w )2 → 0

  • nly if

η(t) → 0 for t → ∞

  • ne can show:

lim

t→∞

P

t η(t) → ∞ but

lim

t→∞

P

t η(t)2 < ∞

is required satisfied by, e.g. η(t) ∝ 1 t for large t learning rate schedules, e.g. η(t) = a b + t

83

alternative: averages of w over recent (or all) gradient steps

slide-36
SLIDE 36

36

Plateau states frequent observation: training of multilayer networks is delayed by quasi-stationary plateaus

(S.J. Hanson, in: Y. Chauvin and D.E. Rummelhart, Backpropagation: Theory, Architectures, and Applications, 1995)

84

slide-37
SLIDE 37

37

example: a two-layer network trained from reliable, perfectly realizable data by on-line gradient descent

εg

number of examples P/(KN)

  • fast initial decrease of εg
  • fast asymptotic decrease of εg →0

(here: matching complexity)

  • plateau state:

unspecialized h.u. with wk∼wo+noise have all obtained some (the same) information about the unknown rule

  • ccurence of plateaus relates to symmetries:

the network output is invariant under permutations of hidden units perfectly symmetric state corresponds to a flat region (saddle) in E successful learning requires specialization and can be delayed significantly

  • math. analysis: D. Saad and S. Solla (1995), M. Biehl, P

.Riegler, C. W¨

  • hler (1996)

85

analysed in depth in the statistical physics community (1990’s) problem re-discovered in deep learning

slide-38
SLIDE 38

Shallow and deep netw tworks

slide-39
SLIDE 39

Neural Networks

  • shallow networks

frequently used: input-hidden-output architectures, e.g. N-M-1

  • ften shown to be universal approximators / classifiers

easy to implement efficient, fast training, e.g. by backpropagation examples: Committee/Parity Machine Extreme Learning Machine Radial Basis Function Networks special case: Reservoir Computing replace hidden layer by a dynamical network with intra- layer connections and/or internal dynamics

  • deep networks (at a glimpse)

deep learning, convolutional neural networks

shallow and deep archite tectu tures

slide-40
SLIDE 40

Neural Networks

..... .....

Ex Extr treme Learning Machine (EL ELM)

..... .....

input: N-dim. feature vectors x random input-to-hidden weights (fixed, non-adaptive) hidden layer: M units (e.g. M>N) e.g. sigmoidal adaptive hidden-to-output weights linear output

σ = (σ1, σ2, . . . , σM)>

= (w1, w2, . . . , wM)> ∈ I RM⇥N

σj = g

  • w>

j x

  • S = v> σ

v = (v1, v2, . . . , vM)>

training (hidden-to-output only!) by regression w.r.t. given targets e.g. least square solution obtained as Moore-Penrose pseudoinverse

slide-41
SLIDE 41

Neural Networks

  • Huang et al. (IJCNN 2004): concept and name,

see original (and later) publications provided in Nestor

  • triggered numerous publications, even specialized journals and

ELM conferences

  • serious, on-going debate about originality of the concept, see

Wikipedia entry and the Comment by Wang and Wan in IEEE TNN (2008) see also: http://elmorigin.weebly.com. One example early paper with similar ideas: Schmidt, Kraaijveld, Duin, ICPR 1992

  • conceptual similarity to SVM is discussed in, e.g., Frenay and Verleysen:

Using SVMs with randomised feature spaces: an extreme learning approach

Ex Extr treme Learning Machine (EL ELM)

slide-42
SLIDE 42

Neural Networks

Radial Basis Functi tions (RBF) netw tworks

..... ..... ..... .....

input: N-dim. feature vectors x

σi = g(|x − ci|)

hidden layer: M units, activation (*) depends on distance of x from center ci: * example:

σi = exp ⇥ −β (x − ci)2⇤

σi = exp ⇥ −β (x − ci)2⇤ PM

j=1 exp [−β (x − cj)2]

unnormalized (local) normalized (constant total activation)

S =

M

X

j=1

vjσj

e.g. linear output unit adaptive: centers ci (e.g. by unsupervised vector quantization) weights v (e.g. by least squares regression for given centers) beyond RBF:

slide-43
SLIDE 43

Neural Networks

RBF clas RBF classif ifier ier

..... ..... ..... .....

input: N-dim. feature vectors x

σi = g(|x − ci|)

hidden layer: M units, activation (*) depends on distance of x from center ci: adaptive hidden-to-output weights (C pseudo-regression problems )

  • r fixed, pre-wired function

assign input to class with maximum score very similar concept: Learning Vector Quantization

  • utput units represent C classes,

compute class-membership scores

... ... [RBF-networks: see book by Bishop for detailed discussion and references ]

slide-44
SLIDE 44

Neural Networks

..... .....

recurrent network as reservoir: fixed random connections, represents inputs by different internal states

  • leaky integrator units

liquid state machine

  • sparsely connected attractor net

echo-state networks

  • utput:

linear unit with adaptive weights read-out of the reservoir state regression training: comparison with target output for a given set of input/output examples input: enforce (initial) state in the reservoir network (or a subset of units)

Reservoir Computi ting

slide-45
SLIDE 45

Neural Networks

most prominent examples in the literature: (see Nestor for original publications and review articles) echo-state networks [Jaeger 2001] liquid state machines [Natschlaeger et al. 2002] decorrelation-backpropagation [Steil 2004] see also: http://reservoir-computing.org

Reservoir Computi ting

slide-46
SLIDE 46

Neural Networks

feed-forward networks with a large (?) number of layers and units combination of several concepts / methods / tricks training became feasible due to ... increased computational power (backpropagation of error) sparse connectivity (e.g. convolutional networks) weight sharing and pooling availability of huge data sets simplified transfer functions („rectified linear units“ g(x)=max{0,x}) efficient regularization techniques (e.g. „dropout“) main application areas with excellent performance: data with spatial / temporal structure image (faces, digits, scenes) classification / recognition

deep netw tworks

Goodfellow, Bengio, Courville: Deep Learning, 2016

slide-47
SLIDE 47

Neural Networks

deep netw tworks The fishermen in the north of Spain have been using Deep Networks for centuries. Their contribution should be recognized... Javier Movellan

From a discussion about the origins of the term “Deep Networks” in the Connectionists mailing list http://dove.ccs.fau.edu/dawei/ICM/connectionists.html