Supervised Learning in Neural Networks Keith L. Downing The - - PowerPoint PPT Presentation

supervised learning in neural networks
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning in Neural Networks Keith L. Downing The - - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning


slide-1
SLIDE 1

Supervised Learning in Neural Networks

Keith L. Downing

The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no

March 7, 2011

Keith L. Downing Supervised Learning in Neural Networks

slide-2
SLIDE 2

Supervised Learning

Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs) to be learned. Weights are modified by a complex procedure (back-propagation) based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a) lots of input-output training data, and b) goal of a mapping (function) from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof.

Keith L. Downing Supervised Learning in Neural Networks

slide-3
SLIDE 3

Backpropagation Overview

Encoder Decoder E = r3 - r* r* r3 d3 Training/Test Cases: {(d1, r1) (d2, r2) (d3, r3)....} dE/dW

Feed-Forward Phase - Inputs sent through the ANN to compute

  • utputs.

Feedback Phase - Error passed back from output to input layers and used to update weights along the way.

Keith L. Downing Supervised Learning in Neural Networks

slide-4
SLIDE 4

Training -vs- Testing

Training Test Cases

Neural Net

N times, with learning 1 time, without learning

Generalization - correctly handling test cases (that ANN has not been trained on). Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases.

Keith L. Downing Supervised Learning in Neural Networks

slide-5
SLIDE 5

Widrow-Hoff (a.k.a. Delta) Rule

1

Y S

1 2 3

w

X

Y T = target output value

δ = error

δ = T - Y

Δw = ηδX

Node N

Delta (δ) = error; Eta (η) = learning rate Goal: change w so as to reduce | δ |. Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing) w if X is positive (negative). Similar for δ < 0 Assumes derivative of N’s transfer function is everywhere non-negative.

Keith L. Downing Supervised Learning in Neural Networks

slide-6
SLIDE 6

Gradient Descent

Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels) to follow the route of steepest descent in error space.

Error(E) Weight Vector (W) min ΔE

∆wij = −η ∂Ei ∂wij

Keith L. Downing Supervised Learning in Neural Networks

slide-7
SLIDE 7

Computing ∂Ei

∂wij

i 1

  • id

tid Eid x1d wi1 n xnd win sumid fT

Sum of Squared Errors (SSE) Ei = 1 2 ∑

d∈D

(tid −oid)2 ∂E ∂wij = 1 2 ∑

d∈D

2(tid −oid)∂(tid −oid) ∂wij = ∑

d∈D

(tid −oid)∂(−oid) ∂wij

Keith L. Downing Supervised Learning in Neural Networks

slide-8
SLIDE 8

Computing ∂(−oid)

∂wij

i 1

  • id

tid Eid x1d wi1 n xnd win sumid fT

Since output = f(sum weighted inputs) ∂E ∂wij = ∑

d∈D

(tid −oid)∂(−fT (sumid)) ∂wij where sumid =

n

k=1

wikxkd Using Chain Rule: ∂f(g(x))

∂x

=

∂f ∂g(x) × ∂g(x) ∂x

∂(fT (sumid)) ∂wij = ∂fT (sumid) ∂sumid × ∂sumid ∂wij = ∂fT (sumid) ∂sumid ×xjd

Keith L. Downing Supervised Learning in Neural Networks

slide-9
SLIDE 9

Computing ∂sumid

∂wij

  • Easy!!

∂sumid ∂wij = ∂

  • ∑n

k=1 wikxkd

  • ∂wij

= ∂

  • wi1x1d +wi2x2d +...+wijxjd +...+winxnd
  • ∂wij

= ∂(wi1x1d) ∂wij + ∂(wi2x2d) ∂wij +...+ ∂(wijxjd) ∂wij +...+ ∂(winxnd) ∂wij = 0+0+...+xjd +...+0 = xjd

Keith L. Downing Supervised Learning in Neural Networks

slide-10
SLIDE 10

Computing ∂fT(sumid)

∂sumid

  • Harder for some fT

fT = Identity function: fT (sumid) = sumid ∂fT (sumid) ∂sumid = 1 Thus: ∂(fT (sumid)) ∂wij = ∂fT (sumid) ∂sumid × ∂sumid ∂wij = 1×xjd = xjd fT = Sigmoid: fT (sumid) =

1 1+e−sumid

∂fT (sumid) ∂sumid = oid(1−oid) Thus: ∂(fT (sumid)) ∂wij = ∂fT (sumid) ∂sumid × ∂sumid ∂wij = oid(1−oid)×xjd = oid(1−oid)xjd

Keith L. Downing Supervised Learning in Neural Networks

slide-11
SLIDE 11

The only non-trivial calculation

∂fT (sumid) ∂sumid = ∂

  • (1+e−sumid )−1

∂sumid = (−1)∂(1+e−sumid ) ∂sumid (1+e−sumid )−2 = (−1)(−1)e−sumid (1+e−sumid )−2 = e−sumid (1+e−sumid )2 But notice that: e−sumid (1+e−sumid )2 = fT (sumid)(1−fT (sumid)) = oid(1−oid)

Keith L. Downing Supervised Learning in Neural Networks

slide-12
SLIDE 12

Putting it all together

∂Ei ∂wij = ∑

d∈D

(tid −oid)∂(−fT (sumid)) ∂wij = − ∑

d∈D

  • (tid −oid)∂fT (sumid)

∂sumid × ∂sumid ∂wij

  • So for fT = Identity:

∂Ei ∂wij = − ∑

d∈D

(tid −oid)xjd and for fT = Sigmoid: ∂Ei ∂wij = − ∑

d∈D

(tid −oid)oid(1−oid)xjd

Keith L. Downing Supervised Learning in Neural Networks

slide-13
SLIDE 13

Weight Updates (fT = Sigmoid)

Batch: update weights after each training epoch ∆wij = −η ∂Ei ∂wij = η ∑

d∈D

(tid −oid)oid(1−oid)xjd The weight changes are actually computed after each training case, but wij is not updated until the epoch’s end. Incremental: update weights after each training case ∆wij = −η ∂Ei ∂wij = η(tid −oid)oid(1−oid)xjd A lower learning rate (η) recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch.

Keith L. Downing Supervised Learning in Neural Networks

slide-14
SLIDE 14

Backpropagation in Multi-Layered Neural Networks

1 n j

sumjd

  • jd

wnj w1j d(ojd) d(sumjd) d(sumnd) d(ojd) d(sum1d) d(ojd) d(Ed) d(sum1d) d(Ed) d(sumnd) Ed

For each node (j) and each training case (d), backpropagation computes an error term: δjd = − ∂Ed ∂sumjd by calculating the influence of sumjd along each connection from node j to the next downstream layer.

Keith L. Downing Supervised Learning in Neural Networks

slide-15
SLIDE 15

Computing δjd

1 n j

sumjd

  • jd

wnj w1j d(ojd) d(sumjd) d(sumnd) d(ojd) d(sum1d) d(ojd) d(Ed) d(sum1d) d(Ed) d(sumnd) Ed

Along the upper path, the contribution to

∂Ed ∂sumjd is:

∂ojd ∂sumjd × ∂sum1d ∂ojd × ∂Ed ∂sum1d So summing along all paths: ∂Ed ∂sumjd = ∂ojd ∂sumjd

n

k=1

∂sumkd ∂ojd ∂Ed ∂sumkd

Keith L. Downing Supervised Learning in Neural Networks

slide-16
SLIDE 16

Computing δjd

Just as before, most terms are 0 in the derivative of the sum, so: ∂sumkd ∂ojd = wkj Assuming fT = a sigmoid: ∂ojd ∂sumjd = ∂fT (sumjd) ∂sumjd = ojd(1−ojd) Thus: δjd = − ∂Ed ∂sumjd = − ∂ojd ∂sumjd

n

k=1

∂sumkd ∂ojd ∂Ed ∂sumkd = −ojd(1−ojd)

n

k=1

wkj(−δkd) = ojd(1−ojd)

n

k=1

wkjδkd

Keith L. Downing Supervised Learning in Neural Networks

slide-17
SLIDE 17

Computing δjd

Note that δjd is defined recursively in terms of the δ values in the next downstream layer: δjd = ojd(1−ojd)

n

k=1

wkjδkd So all δ values in the network can be computed by moving backwards, one layer at a time.

Keith L. Downing Supervised Learning in Neural Networks

slide-18
SLIDE 18

Computing ∂Ed

∂wij from δjd - Easy!!

j i

wij d(Ed) d(sumid)

1

  • jd

wi1 sumid

The only effect of wij upon the error is via its effect upon sumid, which is: ∂sumid ∂wij = ojd So: ∂Ed ∂wij = ∂sumid ∂wij × ∂Ed ∂sumid = ∂sumid ∂wij ×(−δid) = −ojdδid

Keith L. Downing Supervised Learning in Neural Networks

slide-19
SLIDE 19

Computing ∆wij

Given an error term, δid (for node i on training case d), the update of wij for all nodes j that feed into i is: ∆wij = −η ∂Ed ∂wij = −η(−ojdδid) = ηδidojd So given δi, you can easily calculate ∆wij for all incoming arcs.

Keith L. Downing Supervised Learning in Neural Networks

slide-20
SLIDE 20

Learning XOR

200 400 600 800 1000

Epoch

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Error Sum-Squared-Error

Error 200 400 600 800 1000

Epoch

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Error Sum-Squared-Error

Error 200 400 600 800 1000

Epoch

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Error Sum-Squared-Error

Error

Epoch = All 4 entries of the XOR truth table. 2 (inputs) X 2 (hidden) X 1 (output) network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init.

Keith L. Downing Supervised Learning in Neural Networks

slide-21
SLIDE 21

Learning to Classify Wines

Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 ··· ··· 1 13.2 1.78 2.14 11.2 100 2.65 ··· ··· 2 13.11 1.01 1.7 15 78 2.98 ··· ··· 3 13.17 2.59 2.37 20 120 1.65 ··· ··· . . .

1 1 5 2 13 1

Wine Properties Wine Class Hidden Layer Keith L. Downing Supervised Learning in Neural Networks

slide-22
SLIDE 22

Wine Runs

20 40 60 80 100

Epoch

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007

Error Sum-Squared-Error

Error 20 40 60 80 100

Epoch

0.000 0.002 0.004 0.006 0.008 0.010

Error Sum-Squared-Error

Error

13x5x1 (lrate = 0.3) 13x5x1 (lrate = 0.1)

20 40 60 80 100

Epoch

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

Error Sum-Squared-Error

Error 20 40 60 80 100

Epoch

0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

Error Sum-Squared-Error

Error

13 x 10 x 1 (lrate = 0.3) 13 x 25 x 1 (lrate = 0.3)

Keith L. Downing Supervised Learning in Neural Networks

slide-23
SLIDE 23

Momentum: Combatting Local Minima

∆w(t-1) ∆w(t)

∆wij(t) = −η ∂Ei ∂wij +α∆wij(t −1)

Keith L. Downing Supervised Learning in Neural Networks

slide-24
SLIDE 24

Practical Tips

1

Only add as many hidden layers and hidden nodes as necessary. Too many → more weights to learn + increased chance of

  • ver-specialization.

2

Scale all input values to the same range, typically [0 1] or [-1 1].

3

Use target values of 0.1 (for zero) and 0.9 (for 1) to avoid saturation effects of sigmoids.

4

Beware of tricky encodings of input (and decodings of output) values. Don’t combine too much info into a single node’s activation value (even though it’s fun to try), since this can make proper weights difficult (or impossible) to learn.

5

For discrete (e.g. nominal) values, one (input or output) node per value is often most effective. E.g. car model and city of residence -vs- income and education for assessing car-insurance risk.

6

All initial weights should be relatively small: [-0.1 0.1]

7

Bias nodes can be helpful for complicated data sets.

8

Check that all your layer sizes, activation functions, activation ranges, weight ranges, learning rates, etc. make sense in terms of each other and your goals for the ANN. One improper choice can ruin the results.

Keith L. Downing Supervised Learning in Neural Networks

slide-25
SLIDE 25

Bias Nodes

Output Input 1 1 bias node w w 1 w w

Constant output = 1 All outgoing weights are independent and modifiable by backprop. The negative of each such weight functions like a threshold for the downstream node.

Keith L. Downing Supervised Learning in Neural Networks

slide-26
SLIDE 26

Supervised Learning in the Cerebellum

Parallel Fibers Granular Cells Mossy Fibers Climbing Fibers Purkinje Cells Inferior Olive Inhibition of deep cerebellar neurons To Cerebral Cortex To Spinal Cord Sensory + Cortical Inputs Somatosensory (touch, pain, body position) + Cortical Inputs Golgi Cells Efference Copy

Granular cells detect contexts. Parallel fibers and Purkinje cells map contexts to actions Climbing fibers from Inferior Olive provide (supervisory) feedback signals for LTD on Parallel-Purkinje synapses

Keith L. Downing Supervised Learning in Neural Networks