Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - - PowerPoint PPT Presentation

β–Ά
neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


slide-1
SLIDE 1

Neural Networks

Hopfield Nets and Boltzmann Machines Fall 2017

1

slide-2
SLIDE 2
  • Symmetric loopy network
  • Each neuron is a perceptron with +1/-1 output
  • Every neuron receives input from every other neuron
  • Every neuron outputs signals to every other neuron

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

Recap: Hopfield network

2

slide-3
SLIDE 3

Recap: Hopfield network

  • At each time each neuron receives a β€œfield” Οƒπ‘˜β‰ π‘— π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

  • If the sign of the field matches its own sign, it does not

respond

  • If the sign of the field opposes its own sign, it β€œflips” to

match the sign of the field

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

3

slide-4
SLIDE 4

Recap: Energy of a Hopfield Network

𝐹 = βˆ’ ෍

𝑗,π‘˜<𝑗

π‘₯π‘—π‘˜π‘§π‘—π‘§π‘˜

  • The system will evolve until the energy hits a local minimum
  • In vector form, including a bias term (not used in Hopfield nets)

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

4

Not assuming node bias

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

slide-5
SLIDE 5

Recap: Evolution

  • The network will evolve until it arrives at a

local minimum in the energy contour

state PE 5

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³

slide-6
SLIDE 6

Recap: Content-addressable memory

  • Each of the minima is a β€œstored” pattern

– If the network is initialized close to a stored pattern, it will inevitably evolve to the pattern

  • This is a content addressable memory

– Recall memory content from partial or corrupt values

  • Also called associative memory

state PE

6

slide-7
SLIDE 7

Recap – Analogy: Spin Glasses

  • Magnetic diploes
  • Each dipole tries to align itself to the local field

– In doing so it may flip

  • This will change fields at other dipoles

– Which may flip

  • Which changes the field at the current dipole…

7

slide-8
SLIDE 8

Recap – Analogy: Spin Glasses

  • The total potential energy of the system

𝐹(𝑑) = 𝐷 βˆ’ 1 2 ෍

𝑗

𝑦𝑗𝑔 π‘žπ‘— = 𝐷 βˆ’ ෍

𝑗

෍

π‘˜>𝑗

π‘ π‘¦π‘—π‘¦π‘˜ π‘žπ‘— βˆ’ π‘žπ‘˜

2 βˆ’ ෍ 𝑗

π‘π‘—π‘¦π‘˜

  • The system evolves to minimize the PE

– Dipoles stop flipping if any flips result in increase of PE

Total field at current dipole:

𝑔 π‘žπ‘— = ෍

π‘˜β‰ π‘—

π‘ π‘¦π‘˜ π‘žπ‘— βˆ’ π‘žπ‘˜

2 + 𝑐𝑗

Response of current diplose

𝑦𝑗 = ࡝𝑦𝑗 𝑗𝑔 π‘‘π‘—π‘•π‘œ 𝑦𝑗 𝑔 π‘žπ‘— = 1 βˆ’π‘¦π‘— π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓

8

slide-9
SLIDE 9

Recap : Spin Glasses

  • The system stops at one of its stable configurations

– Where PE is a local minimum

  • Any small jitter from this stable configuration returns it to the stable

configuration

– I.e. the system remembers its stable state and returns to it

state PE

9

slide-10
SLIDE 10

Recap: Hopfield net computation

  • Very simple
  • Updates can be done sequentially, or all at once
  • Convergence

𝐹 = βˆ’ ෍

𝑗

෍

π‘˜>𝑗

π‘₯

π‘˜π‘—π‘§π‘˜π‘§π‘—

does not change significantly any more

  • 1. Initialize network with initial pattern

𝑧𝑗 0 = 𝑦𝑗, 0 ≀ 𝑗 ≀ 𝑂 βˆ’ 1

  • 2. Iterate until convergence

𝑧𝑗 𝑒 + 1 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜

, 0 ≀ 𝑗 ≀ 𝑂 βˆ’ 1

10

slide-11
SLIDE 11

Examples: Content addressable memory

  • http://staff.itee.uq.edu.au/janetw/cmc/chapters/Hopfield/

11

slide-12
SLIDE 12

β€œTraining” the network

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

12

slide-13
SLIDE 13

Recap: Hebbian Learning to Store a Specific Pattern

  • For a single stored pattern, Hebbian learning

results in a network for which the target pattern is a global minimum

HEBBIAN LEARNING: π‘₯

π‘˜π‘— = π‘§π‘˜π‘§π‘—

1

  • 1
  • 1
  • 1

1 13

𝐗 = π³π‘žπ³π‘ž

π‘ˆ βˆ’ I

slide-14
SLIDE 14

Hebbian learning: Storing a 4-bit pattern

  • Left: Pattern stored. Right: Energy map
  • Stored pattern has lowest energy
  • Gradation of energy ensures stored pattern (or its ghost) is recalled

from everywhere

14

slide-15
SLIDE 15
  • {p} is the set of patterns to store

– Superscript π‘ž represents the specific pattern

  • π‘‚π‘ž is the number of patterns to store

1

  • 1
  • 1
  • 1

1 1 1

  • 1

1

  • 1

15

𝐗 = ෍

π‘ž

π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐉 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

Recap: Hebbian Learning to Store Multiple Patterns

slide-16
SLIDE 16

How many patterns can we store?

  • Hopfield: For a network of 𝑂 neurons can

store up to 0.14𝑂 patterns

16

slide-17
SLIDE 17
  • Consider that the network is in any stored state π‘§π‘žβ€²
  • At any node 𝑙 the field we obtain is

β„Žπ‘™

π‘žβ€² = ෍ π‘˜

𝑧𝑙

π‘žβ€² π‘§π‘˜ π‘žβ€²π‘§π‘˜ π‘žβ€² + ෍ π‘žβ‰ π‘žβ€²

෍

π‘˜

𝑧𝑙

π‘žπ‘§π‘˜ π‘ž π‘§π‘˜ π‘žβ€² = (𝑂 βˆ’ 1)𝑧𝑙 π‘žβ€² + ෍ π‘žβ‰ π‘žβ€²

෍

π‘˜

𝑧𝑙

π‘žπ‘§π‘˜ π‘žπ‘§π‘˜ π‘žβ€²

  • If the second β€œcrosstalk” term sums to less than 𝑂 βˆ’ 1, the symbol will not

flip

1

  • 1
  • 1
  • 1

1 17

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

Recap: Hebbian Learning to Store a Specific Pattern

slide-18
SLIDE 18

β„Žπ‘™

π‘žβ€² = ෍ π‘˜

𝑧𝑙

π‘žβ€² π‘§π‘˜ π‘žβ€²π‘§π‘˜ π‘žβ€² + ෍ π‘žβ‰ π‘žβ€²

෍

π‘˜

𝑧𝑙

π‘žπ‘§π‘˜ π‘ž π‘§π‘˜ π‘žβ€² = (𝑂 βˆ’ 1)𝑧𝑙 π‘žβ€² + ෍ π‘žβ‰ π‘žβ€²

෍

π‘˜

𝑧𝑙

π‘žπ‘§π‘˜ π‘žπ‘§π‘˜ π‘žβ€²

  • If 𝑧𝑙

π‘žβ€² Οƒπ‘žβ‰ π‘žβ€² Οƒπ‘˜ 𝑧𝑙 π‘žπ‘§π‘˜ π‘žπ‘§π‘˜ π‘žβ€² is positive, then Οƒπ‘žβ‰ π‘žβ€² Οƒπ‘˜ 𝑧𝑙 π‘žπ‘§π‘˜ π‘ž π‘§π‘˜ π‘žβ€² is the same

sign as 𝑧𝑙

π‘žβ€², and it will not flip

  • If we choose 𝑄 patterns at random, what is the probability that

𝑧𝑙

π‘žβ€² Οƒπ‘žβ‰ π‘žβ€² Οƒπ‘˜ 𝑧𝑙 π‘žπ‘§π‘˜ π‘žπ‘§π‘˜ π‘žβ€² will be positive for all symbols for all 𝑄 of them?

1

  • 1
  • 1
  • 1

1 18

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

Recap: Hebbian Learning to Store a Specific Pattern

slide-19
SLIDE 19

How many patterns can we store?

  • Hopfield: For a network of 𝑂 neurons can

store up to 0.14𝑂 patterns

  • What does this really mean?

– Lets look at some examples

19

slide-20
SLIDE 20

Hebbian learning: One 4-bit pattern

  • Left: Pattern stored. Right: Energy map
  • Note: Pattern is an energy well, but there are other local minima

– Where? – Also note β€œshadow” pattern

20

slide-21
SLIDE 21

Storing multiple patterns: Orthogonality

  • The maximum Hamming distance between two 𝑂-bit

patterns is 𝑂/2

– Because any pattern 𝑍 = βˆ’π‘ for our purpose

  • Two patterns 𝑧1and 𝑧2 that differ in 𝑂/2 bits are
  • rthogonal

– Because 𝑧1

π‘ˆπ‘§2 = 0

  • For 𝑂 = 2𝑁𝑀, where 𝑀 is an odd number, there are at most

2𝑁 orthogonal binary patterns

– Others may be almost orthogonal

21

slide-22
SLIDE 22

Two orthogonal 4-bit patterns

  • Patterns are local minima (stationary and stable)

– No other local minima exist – But patterns perfectly confusable for recall

22

slide-23
SLIDE 23

Two non-orthogonal 4-bit patterns

  • Patterns are local minima (stationary and stable)

– No other local minima exist – Actual wells for patterns

  • Patterns may be perfectly recalled!

– Note K > 0.14 N

23

slide-24
SLIDE 24

Three orthogonal 4-bit patterns

  • All patterns are local minima (stationary and

stable)

– But recall from perturbed patterns is random

24

slide-25
SLIDE 25

Three non-orthogonal 4-bit patterns

  • All patterns are local minima and recalled

– Note K > 0.14 N – Note some β€œghosts” ended up in the β€œwell” of other patterns

  • So one of the patterns has stronger recall than the other two

25

slide-26
SLIDE 26

Four orthogonal 4-bit patterns

  • All patterns are stationary, but none are stable

– Total wipe out

26

slide-27
SLIDE 27

Four nonorthogonal 4-bit patterns

  • Believe it or not, all patterns are stored for K = N!

– Only β€œcollisions” when the ghost of one pattern occurs next to another

  • [1 1 1 1] and its ghost are strong attractors (why)

27

slide-28
SLIDE 28

How many patterns can we store?

  • Hopfield: For a network of 𝑂 neurons can store up to

0.14𝑂 patterns

  • Apparently a fuzzy statement

– What does it really mean to say β€œstores” 0.14N patterns?

  • Stationary? Stable? No other local minima?
  • N=4 may not be a good case (N too small)

28

slide-29
SLIDE 29

A 6-bit pattern

  • Perfectly stationary and stable
  • But many spurious local minima..

– Which are β€œfake” memories

29

slide-30
SLIDE 30

Two orthogonal 6-bit patterns

  • Perfectly stationary and stable
  • Several spurious β€œfake-memory” local minima..

– Figure over-states the problem: actually a 3-D Kmap

30

slide-31
SLIDE 31

Two non-orthogonal 6-bit patterns

31

  • Perfectly stationary and stable
  • Some spurious β€œfake-memory” local minima..

– But every stored pattern has β€œbowl” – Fewer spurious minima than for the orthogonal case

slide-32
SLIDE 32

Three non-orthogonal 6-bit patterns

32

  • Note: Cannot have 3 or more orthogonal 6-bit patterns..
  • Patterns are perfectly stationary and stable (K > 0.14N)
  • Some spurious β€œfake-memory” local minima..

– But every stored pattern has β€œbowl” – Fewer spurious minima than for the orthogonal 2-pattern case

slide-33
SLIDE 33

Four non-orthogonal 6-bit patterns

33

  • Patterns are perfectly stationary and stable for K > 0.14N
  • Fewer spurious minima than for the orthogonal 2-pattern

case

– Most fake-looking memories are in fact ghosts..

slide-34
SLIDE 34

Six non-orthogonal 6-bit patterns

34

  • Breakdown largely due to interference from β€œghosts”
  • But patterns are stationary, and often stable

– For K >> 0.14N

slide-35
SLIDE 35

More visualization..

  • Lets inspect a few 8-bit patterns

– Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract

35

slide-36
SLIDE 36

One 8-bit pattern

36

  • Its actually cleanly stored, but there are a few

spurious minima

slide-37
SLIDE 37

Two orthogonal 8-bit patterns

37

  • Both have regions of attraction
  • Some spurious minima
slide-38
SLIDE 38

Two non-orthogonal 8-bit patterns

38

  • Actually have fewer spurious minima

– Not obvious from visualization..

slide-39
SLIDE 39

Four orthogonal 8-bit patterns

39

  • Successfully stored
slide-40
SLIDE 40

Four non-orthogonal 8-bit patterns

40

  • Stored with interference from ghosts..
slide-41
SLIDE 41

Eight orthogonal 8-bit patterns

41

  • Wipeout
slide-42
SLIDE 42

Eight non-orthogonal 8-bit patterns

42

  • Nothing stored

– Neither stationary nor stable

slide-43
SLIDE 43

Making sense of the behavior

  • Seems possible to store K > 0.14N patterns

– i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary – Possible to make more than 0.14N patterns at-least 1-bit stable

  • So what was Hopfield talking about?
  • Patterns that are non-orthogonal easier to remember

– I.e. patterns that are closer are easier to remember than patterns that are farther!!

  • Can we attempt to get greater control on the process than

Hebbian learning gives us?

43

slide-44
SLIDE 44

Bold Claim

  • I can always store (upto) N orthogonal

patterns such that they are stationary!

– Although not necessarily stable

  • Why?

44

slide-45
SLIDE 45

β€œTraining” the network

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

45

slide-46
SLIDE 46

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

46

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same
slide-47
SLIDE 47

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

47

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same

Both have the same Eigen vectors

slide-48
SLIDE 48

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

48

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same

NOTE: This is a positive semidefinite matrix Both have the same Eigen vectors

slide-49
SLIDE 49

Consider the energy function

  • Reinstating the bias term for completeness sake

– Remember that we don’t actually use it in a Hopfield net

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

49

slide-50
SLIDE 50

Consider the energy function

  • Reinstating the bias term for completeness sake

– Remember that we don’t actually use it in a Hopfield net

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ This is a quadratic! For Hebbian learning W is positive semidefinite E is convex

50

slide-51
SLIDE 51

The energy function

  • 𝐹 is a convex quadratic

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

51

slide-52
SLIDE 52

The energy function

  • 𝐹 is a convex quadratic

– Shown from above (assuming 0 bias)

  • But components of 𝑧 can only take values Β±1

– I.e 𝑧 lies on the corners of the unit hypercube

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

52

slide-53
SLIDE 53

The energy function

  • 𝐹 is a convex quadratic

– Shown from above (assuming 0 bias)

  • But components of 𝑧 can only take values Β±1

– I.e 𝑧 lies on the corners of the unit hypercube

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

53

slide-54
SLIDE 54

The energy function

  • The stored values of 𝐳 are the ones where all

adjacent corners are higher on the quadratic

– Hebbian learning attempts to make the quadratic steep in the vicinity of stored patterns

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

Stored patterns

54

slide-55
SLIDE 55

Patterns you can store

  • Ideally must be maximally separated on the hypercube

– The number of patterns we can store depends on the actual distance between the patterns

Stored patterns Ghosts (negations)

55

slide-56
SLIDE 56

Storing patterns

  • A pattern 𝐳𝑄 is stored if:

– π‘‘π‘—π‘•π‘œ π—π³π‘ž = π³π‘ž for all target patterns

  • Note: for binary vectors π‘‘π‘—π‘•π‘œ 𝐳 is a projection

– Projects 𝐳 onto the nearest corner of the hypercube – It β€œquantizes” the space into orthants

56

slide-57
SLIDE 57

Storing patterns

  • A pattern 𝐳𝑄 is stored if:

– π‘‘π‘—π‘•π‘œ π—π³π‘ž = π³π‘ž for all target patterns

  • Training: Design 𝐗 such that this holds
  • Simple solution: π³π‘ž is an Eigenvector of 𝐗

– And the corresponding Eigenvalue is positive π—π³π‘ž = πœ‡π³π‘ž – More generally orthant(π—π³π‘ž) = orthant(π³π‘ž)

  • How many such π³π‘žcan we have?

57

slide-58
SLIDE 58

Only N patterns?

  • Patterns that differ in 𝑂/2 bits are orthogonal
  • You can have no more than 𝑂 orthogonal vectors

in an 𝑂-dimensional space

59

(1,1) (1,-1)

slide-59
SLIDE 59

Another random fact that should interest you

  • The Eigenvectors of any symmetric matrix 𝐗

are orthogonal

  • The Eigenvalues may be positive or negative

60

slide-60
SLIDE 60

Storing more than one pattern

  • Requirement: Given 𝐳1, 𝐳2, … , 𝐳𝑄

– Design 𝐗 such that

  • π‘‘π‘—π‘•π‘œ π—π³π‘ž = π³π‘ž for all target patterns
  • There are no other binary vectors for which this holds
  • What is the largest number of patterns that

can be stored?

61

slide-61
SLIDE 61

Storing 𝑳 orthogonal patterns

  • Simple solution: Design 𝐗 such that 𝐳1,

𝐳2, … , 𝐳𝐿 are the Eigen vectors of 𝐗

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝑋 = π‘Ξ›π‘π‘ˆ – πœ‡1, … , πœ‡πΏ are positive – For πœ‡1 = πœ‡2 = πœ‡πΏ = 1 this is exactly the Hebbian rule

  • The patterns are provably stationary

62

slide-62
SLIDE 62

Hebbian rule

  • In reality

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1 – πœ‡πΏ+1 , … , πœ‡π‘‚ = 0

  • All patterns orthogonal to 𝐳1 𝐳2 … 𝐳𝐿are also

stationary

– Although not stable

63

slide-63
SLIDE 63

Storing 𝑢 orthogonal patterns

  • When we have 𝑂 orthogonal (or near
  • rthogonal) patterns 𝐳1, 𝐳2, … , 𝐳𝑂

– 𝑍 = 𝐳1 𝐳2 … 𝐳𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – πœ‡1 = πœ‡2 = πœ‡π‘‚ = 1

  • The Eigen vectors of 𝑋 span the space
  • Also, for any 𝐳𝑙

𝐗𝐳𝑙 = 𝐳𝑙

64

slide-64
SLIDE 64

Storing 𝑢 orthogonal patterns

  • The 𝑂 orthogonal patterns 𝐳1, 𝐳2, … , 𝐳𝑂 span the

space

  • Any pattern 𝐳 can be written as

𝐳 = 𝑏1𝐳1 + 𝑏2𝐳2 + β‹― + 𝑏𝑂𝐳𝑂 𝐗𝐳 = 𝑏1𝐗𝐳1 + 𝑏2𝐗𝐳2 + β‹― + 𝑏𝑂𝐗𝐳𝑂 = 𝑏1𝐳1 + 𝑏2𝐳2 + β‹― + 𝑏𝑂𝐳𝑂 = 𝐳

  • All patterns are stable

– Remembers everything – Completely useless network

65

slide-65
SLIDE 65

Storing K orthogonal patterns

  • Even if we store fewer than 𝑂 patterns

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1 – πœ‡πΏ+1 , … , πœ‡π‘‚ = 0

  • All patterns orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 are stationary
  • Any pattern that is entirely in the subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿is also

stable (same logic as earlier)

  • Only patterns that are partially in the subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿 are

unstable

– Get projected onto subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿

66

slide-66
SLIDE 66

Problem with Hebbian Rule

  • Even if we store fewer than 𝑂 patterns

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1

  • Problems arise because Eigen values are all 1.0

– Ensures stationarity of vectors in the subspace – What if we get rid of this requirement?

67

slide-67
SLIDE 67

Hebbian rule and general (non-

  • rthogonal) vectors

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

  • What happens when the patterns are not orthogonal
  • What happens when the patterns are presented more than
  • nce

– Different patterns presented different numbers of times – Equivalent to having unequal Eigen values..

  • Can we predict the evolution of any vector 𝐳

– Hint: Lanczos iterations

  • Can write 𝐙𝑄 = π™π‘π‘ π‘’β„Žπ‘π‚, οƒ  𝐗 = π™π‘π‘ π‘’β„Žπ‘π‚Ξ›π‚π‘ˆπ™π‘π‘ π‘’β„Žπ‘

π‘ˆ

68

slide-68
SLIDE 68

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

69

slide-69
SLIDE 69

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

70

How do we find this network?

slide-70
SLIDE 70

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

71

Can we do something about this? How do we find this network?

slide-71
SLIDE 71

A different tack

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

72

slide-72
SLIDE 72

Consider the energy function

  • This must be maximally low for target patterns
  • Must be maximally high for all other patterns

– So that they are unstable and evolve into one of the target patterns 𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

73

slide-73
SLIDE 73

Alternate Approach to Estimating the Network

  • Estimate 𝐗 (and 𝐜) such that

– 𝐹 is minimized for 𝐳1, 𝐳2, … , 𝐳𝑄 – 𝐹 is maximized for all other 𝐳

  • Caveat: Unrealistic to expect to store more than

𝑂 patterns, but can we make those 𝑂 patterns memorable

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

74

slide-74
SLIDE 74

Optimizing W (and b)

  • Minimize total energy of target patterns

– Problem with this? 𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

75

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳)

The bias can be captured by another fixed-value component

slide-75
SLIDE 75

Optimizing W

  • Minimize total energy of target patterns
  • Maximize the total energy of all non-target

patterns

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

76

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳)

slide-76
SLIDE 76

Optimizing W

  • Simple gradient descent:

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

77

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳) 𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-77
SLIDE 77

Optimizing W

  • Can β€œemphasize” the importance of a pattern

by repeating

– More repetitions οƒ  greater emphasis

78

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-78
SLIDE 78

Optimizing W

  • Can β€œemphasize” the importance of a pattern

by repeating

– More repetitions οƒ  greater emphasis

  • How many of these?

– Do we need to include all of them? – Are all equally important?

79

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-79
SLIDE 79

The training again..

  • Note the energy contour of a Hopfield

network for any weight 𝐗

80

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy Bowls will all actually be quadratic

slide-80
SLIDE 80

The training again

  • The first term tries to minimize the energy at target patterns

– Make them local minima – Emphasize more β€œimportant” memories by repeating them more frequently

81

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy Target patterns

slide-81
SLIDE 81

The negative class

  • The second term tries to β€œraise” all non-target

patterns

– Do we need to raise everything?

82

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy

slide-82
SLIDE 82

Option 1: Focus on the valleys

  • Focus on raising the valleys

– If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish

83

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

state Energy

slide-83
SLIDE 83

Identifying the valleys..

  • Problem: How do you identify the valleys for

the current 𝐗?

84

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

state Energy

slide-84
SLIDE 84

Identifying the valleys..

85

state Energy

  • Initialize the network randomly and let it evolve

– It will settle in a valley

slide-85
SLIDE 85

Training the Hopfield network

  • Initialize 𝐗
  • Compute the total outer product of all target patterns

– More important patterns presented more frequently

  • Randomly initialize the network several times and let it

evolve

– And settle at a valley

  • Compute the total outer product of valley patterns
  • Update weights

86

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-86
SLIDE 86

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Randomly initialize the network and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

87

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-87
SLIDE 87

Training the Hopfield network

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Randomly initialize the network and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

88

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-88
SLIDE 88

Which valleys?

89

state Energy

  • Should we randomly sample valleys?

– Are all valleys equally important?

slide-89
SLIDE 89

Which valleys?

90

state Energy

  • Should we randomly sample valleys?

– Are all valleys equally important?

  • Major requirement: memories must be stable

– They must be broad valleys

  • Spurious valleys in the neighborhood of

memories are more important to eliminate

slide-90
SLIDE 90

Identifying the valleys..

91

state Energy

  • Initialize the network at valid memories and let it evolve

– It will settle in a valley. If this is not the target pattern, raise it

slide-91
SLIDE 91

Training the Hopfield network

  • Initialize 𝐗
  • Compute the total outer product of all target patterns

– More important patterns presented more frequently

  • Initialize the network with each target pattern and let it

evolve

– And settle at a valley

  • Compute the total outer product of valley patterns
  • Update weights

92

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-92
SLIDE 92

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Initialize the network at π³π‘ž and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

93

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-93
SLIDE 93

A possible problem

94

state Energy

  • What if there’s another target pattern

downvalley

– Raising it will destroy a better-represented or stored pattern!

slide-94
SLIDE 94

A related issue

  • Really no need to raise the entire surface, or

even every valley

95

state Energy

slide-95
SLIDE 95

A related issue

  • Really no need to raise the entire surface, or even

every valley

  • Raise the neighborhood of each target memory

– Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley

96

state Energy

slide-96
SLIDE 96

Raising the neighborhood

97

state Energy

  • Starting from a target pattern, let the network

evolve only a few steps

– Try to raise the resultant location

  • Will raise the neighborhood of targets
  • Will avoid problem of down-valley targets
slide-97
SLIDE 97

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Initialize the network at π³π‘ž and let it evolve a few steps (2- 4)

  • And arrive at a down-valley position 𝐳𝑒

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑒𝐳𝑒 π‘ˆ

98

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-98
SLIDE 98

A probabilistic interpretation

  • For continuous 𝐳, the energy of a pattern is a perfect

analog to the negative log likelihood of a Gaussian density

  • For binary y it is the analog of the negative log likelihood of

a Boltzmann distribution

– Minimizing energy maximizes log likelihood

99

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 2 π³π‘ˆπ—π³ 𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 2 π³π‘ˆπ—π³

slide-99
SLIDE 99

The Boltzmann Distribution

  • 𝑙 is the Boltzmann constant
  • π‘ˆ is the temperature of the system
  • The energy terms are like the loglikelihood of a Boltzmann

distribution at π‘ˆ = 1

– Derivation of this probability is in fact quite trivial..

100

𝐹 𝐳 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) π‘™π‘ˆ 𝐷 = 1 σ𝐳 𝑄(𝐳)

slide-100
SLIDE 100

Continuing the Boltzmann analogy

  • The system probabilistically selects states with

lower energy

– With infinitesimally slow cooling, at π‘ˆ = 0, it arrives at the global minimal state

101

𝐹 𝐳 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) π‘™π‘ˆ 𝐷 = 1 σ𝐳 𝑄(𝐳)

slide-101
SLIDE 101

Spin glasses and Hopfield nets

  • Selecting a next state is akin to drawing a

sample from the Boltzmann distribution at π‘ˆ = 1, in a universe where 𝑙 = 1

102

state Energy

slide-102
SLIDE 102

Lookahead..

  • The Boltzmann analogy
  • Adding capacity to a Hopfield network

103

slide-103
SLIDE 103

Storing more than N patterns

  • How do we increase the capacity of the

network

– Store more patterns

104

slide-104
SLIDE 104

Expanding the network

  • Add a large number of neurons whose actual

values you don’t care about!

N Neurons K Neurons

105

slide-105
SLIDE 105

Expanded Network

  • New capacity: ~(N+K) patterns

– Although we only care about the pattern of the first N neurons – We’re interested in N-bit patterns

N Neurons K Neurons

106

slide-106
SLIDE 106

Introducing…

  • The Boltzmann machine…
  • Friday please…

N Neurons K Neurons

107