XOR with intermediate (hidden) units Delta rule as gradient descent - - PowerPoint PPT Presentation

xor with intermediate hidden units delta rule as gradient
SMART_READER_LITE
LIVE PREVIEW

XOR with intermediate (hidden) units Delta rule as gradient descent - - PowerPoint PPT Presentation

XOR with intermediate (hidden) units Delta rule as gradient descent in error (sigmoid units) n j = a i w ij i Intermediate units can re-represent 1 input patterns as new patterns with a j = w ij t j 1 + exp ( n j ) altered


slide-1
SLIDE 1

XOR with intermediate (“hidden”) units

Intermediate units can re-represent input patterns as new patterns with altered similarities Targets which are not linearly separable in the input space can be linearly separable in the intermediate representational space Intermediate units are called “hidden” because their activations are not determined directly by the training environment (inputs and targets)

1 / 1

hidden

  • utput

Hidden-to-output weights can be trained with the Delta rule How can we train input-to-hidden weights?

Hidden units do not have targets (for determining error) Trick: We don’t need targets, we just need to know how hidden activations affect error (i.e., error derivatives)

2 / 1

Delta rule as gradient descent in error (sigmoid units)

nj =

  • i

ai wij aj = 1 1 + exp (−nj) Error E = 1 2

  • j

(tj − aj)2

wij ai → nj → tj aj → E

Gradient descent: △wij = −ǫ ∂E ∂wij ∂E ∂wij = ∂E ∂aj daj dnj ∂nj ∂wij = − (tj − aj) aj (1 − aj) ai △wij = −ǫ ∂E ∂wij = ǫ (tj − aj) aj (1 − aj) ai

3 / 1

Generalized Delta rule (“back-propagation”)

nj =

  • i

ai wij aj = 1 1 + exp (−nj) Error E = 1 2

  • j

(tj − aj)2

hidden

  • utput

ni → wij ai → nj → tj aj → E

Intermediate notation (“input derivatives” in Lens)

Gradient descent: △ wij = −ǫ ∂E ∂wij ∂E ∂nj = ∂E ∂aj daj dnj = − (tj − aj) aj (1 − aj) ∂E ∂wij = ∂E ∂nj ∂nj ∂wij = ∂E ∂nj ai ∂E ∂ai =

  • j

∂E ∂nj ∂nj ∂ai =

  • j

∂E ∂nj wij

4 / 1

slide-2
SLIDE 2

Back-propagation

Forward pass (⇑) Backward pass (⇓) aj = 1 1 + exp (−nj) nj =

  • i

ai wij ai = 1 1 + exp (−ni) ∂E ∂aj = −(tj − aj) ∂E ∂nj = ∂E ∂aj aj(1 − aj) ∂E ∂wij = ∂E ∂nj ai ∂E ∂ai =

  • j

∂E ∂nj wij

5 / 1

What do hidden representations learn?

Plaut and Shallice (1993) Mapped orthography to semantics (unrelated similarities) Compared similarities among hidden representations to those among

  • rthographic and semantic representations

(over settling) Hidden representations “split the difference” between input and

  • utput similarity

6 / 1

Accelerating learning: Momentum descent

△wij [t] = −ǫ ∂E ∂wij + α

  • △w[t−1]

ij

  • 7 / 1

“Auto-encoder” network (4–2–4)

8 / 1

slide-3
SLIDE 3

Projections of error surface in weight space

Asterisk: error of current set of weights Tick mark: error of next set of weights Solid curve (0): Gradient direction Solid curve (21): Integrated gradient direction (including momentum)

This is actual direction of weight step (tick mark is on this curve) Number is angle with gradient direction

Dotted curves: Random directions (each labeled by angle with gradient direction)

9 / 1

Epochs 1-2

10 / 1

Epochs 3-4

11 / 1

Epochs 5-6

12 / 1

slide-4
SLIDE 4

Epochs 7-8

13 / 1

Epochs 9-10

14 / 1

Epochs 25-50

15 / 1

Epochs 75-end

16 / 1

slide-5
SLIDE 5

High momentum (epochs 1-2)

17 / 1

High momentum (epochs 3-4)

18 / 1

High learning rate (epochs 1-2)

19 / 1

High learning rate (epochs 3-4)

20 / 1