Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation

Four non- orthogonal 6-bit patterns • Patterns are perfectly stationary and stable for K > 0.14N • Fewer spurious minima than for the orthogonal 2-pattern case – Most fake-looking memories are in fact ghosts.. 33

Six non- orthogonal 6-bit patterns • Breakdown largely due to interference from “ghosts” • But patterns are stationary, and often stable – For K >> 0.14N 34

More visualization.. • Lets inspect a few 8-bit patterns – Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract 35

One 8-bit pattern • Its actually cleanly stored, but there are a few spurious minima 36

Two orthogonal 8-bit patterns • Both have regions of attraction • Some spurious minima 37

Two non-orthogonal 8-bit patterns • Actually have fewer spurious minima – Not obvious from visualization.. 38

Four orthogonal 8-bit patterns • Successfully stored 39

Four non-orthogonal 8-bit patterns • Stored with interference from ghosts.. 40

Eight orthogonal 8-bit patterns • Wipeout 41

Eight non-orthogonal 8-bit patterns • Nothing stored – Neither stationary nor stable 42

Making sense of the behavior • Seems possible to store K > 0.14N patterns – i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary – Possible to make more than 0.14N patterns at-least 1-bit stable • So what was Hopfield talking about? • Patterns that are non-orthogonal easier to remember – I.e. patterns that are closer are easier to remember than patterns that are farther!! • Can we attempt to get greater control on the process than Hebbian learning gives us? 43

Bold Claim • I can always store (upto) N orthogonal patterns such that they are stationary! – Although not necessarily stable • Why? 44

“Training” the network • How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization • Secondary question – How many patterns can we store? 45

A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by an additive constant • Is identical to behavior with Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same • Since 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 46

A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by Both have the an additive constant • Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same • Since 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 47

A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by Both have the an additive constant • Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same NOTE: This • Since is a positive semidefinite matrix 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 48

Consider the energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 49

Consider the energy function This is a quadratic! For Hebbian learning W is positive semidefinite E is convex 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 50

The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic 51

The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) • But components of 𝑧 can only take values ±1 – I.e 𝑧 lies on the corners of the unit hypercube 52

The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) • But components of 𝑧 can only take values ±1 – I.e 𝑧 lies on the corners of the unit hypercube 53

The energy function Stored patterns 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • The stored values of 𝐳 are the ones where all adjacent corners are higher on the quadratic – Hebbian learning attempts to make the quadratic steep in the vicinity of stored patterns 54

Patterns you can store Ghosts (negations) Stored patterns • Ideally must be maximally separated on the hypercube – The number of patterns we can store depends on the actual distance between the patterns 55

Storing patterns • A pattern 𝐳 𝑄 is stored if: – 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • Note: for binary vectors 𝑡𝑗𝑕𝑜 𝐳 is a projection – Projects 𝐳 onto the nearest corner of the hypercube – It “quantizes” the space into orthants 56

Storing patterns • A pattern 𝐳 𝑄 is stored if: – 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • Training: Design 𝐗 such that this holds • Simple solution: 𝐳 𝑞 is an Eigenvector of 𝐗 – And the corresponding Eigenvalue is positive 𝐗𝐳 𝑞 = 𝜇𝐳 𝑞 – More generally orthant( 𝐗𝐳 𝑞 ) = orthant( 𝐳 𝑞 ) • How many such 𝐳 𝑞 can we have? 57

Only N patterns? (1,1) (1,-1) • Patterns that differ in 𝑂/2 bits are orthogonal • You can have no more than 𝑂 orthogonal vectors in an 𝑂 -dimensional space 59

Another random fact that should interest you • The Eigenvectors of any symmetric matrix 𝐗 are orthogonal • The Eigen values may be positive or negative 60

Storing more than one pattern • Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that • 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • There are no other binary vectors for which this holds • What is the largest number of patterns that can be stored? 61

Storing 𝑳 orthogonal patterns • Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝑋 = 𝑍Λ𝑍 𝑈 – 𝜇 1 , … , 𝜇 𝐿 are positive – For 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 this is exactly the Hebbian rule • The patterns are provably stationary 62

Hebbian rule • In reality – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 63

Storing 𝑶 orthogonal patterns • When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝜇 1 = 𝜇 2 = 𝜇 𝑂 = 1 • The Eigen vectors of 𝑋 span the space • Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 64

Storing 𝑶 orthogonal patterns • The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space • Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + ⋯ + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 = 𝐳 • All patterns are stable – Remembers everything – Completely useless network 65

Storing K orthogonal patterns • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary • Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) • Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 66

Problem with Hebbian Rule • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 • Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 67

Hebbian rule and general (non- orthogonal) vectors 𝑞 𝑧 𝑘 𝑞 𝑥 𝑘𝑗 = ෍ 𝑧 𝑗 𝑞∈{𝑞} • What happens when the patterns are not orthogonal • What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. • Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations 𝑈 • Can write 𝐙 𝑄 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂 ,  𝐗 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂Λ𝐂 𝑈 𝐙 𝑝𝑠𝑢ℎ𝑝 68

The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 69

The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 70

The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 71

A different tack • How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization • Secondary question – How many patterns can we store? 72

Consider the energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • This must be maximally low for target patterns • Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 73

Alternate Approach to Estimating the Network 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 • Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 74

Optimizing W (and b) 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 The bias can be captured by another fixed-value component • Minimize total energy of target patterns – Problem with this? 75

Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Minimize total energy of target patterns • Maximize the total energy of all non-target patterns 76

Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Simple gradient descent: 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 77

Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis 78

Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis • How many of these? – Do we need to include all of them? – Are all equally important? 79

The training again.. 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 80 state

The training again 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more “important” memories by repeating them more frequently Target patterns Energy 81 state

The negative class 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The second term tries to “raise” all non -target patterns – Do we need to raise everything ? Energy 82 state

Option 1: Focus on the valleys 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 83 state

Identifying the valleys.. 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Problem: How do you identify the valleys for the current 𝐗 ? Energy 84 state

Identifying the valleys.. • Initialize the network randomly and let it evolve – It will settle in a valley Energy 85 state

Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Randomly initialize the network several times and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 86

Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 87

Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 88

Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? Energy 89 state

Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? • Major requirement: memories must be stable – They must be broad valleys • Spurious valleys in the neighborhood of memories are more important to eliminate Energy 90 state

Identifying the valleys.. • Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 91 state

Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Initialize the network with each target pattern and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 92

Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 93

A possible problem • What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 94 state

A related issue • Really no need to raise the entire surface, or even every valley Energy 95 state

A related issue • Really no need to raise the entire surface, or even every valley • Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 96 state

Raising the neighborhood • Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location • Will raise the neighborhood of targets • Will avoid problem of down-valley targets Energy 97 state

Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve a few steps (2- 4) • And arrive at a down-valley position 𝐳 𝑒 – Update weights 𝑈 − 𝐳 𝑒 𝐳 𝑒 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 98

A probabilistic interpretation 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 • For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density • For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 99

The Boltzmann Distribution 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • 𝑙 is the Boltzmann constant • 𝑈 is the temperature of the system • The energy terms are like the loglikelihood of a Boltzmann distribution at 𝑈 = 1 – Derivation of this probability is in fact quite trivial.. 100

Continuing the Boltzmann analogy 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at 𝑈 = 0, it arrives at the global minimal state 101

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Title: Using Chinese Traditional Crosstalk in Academic Development: the Case of Shanghai Jiao

Cyber-Physical Systems Communication IECE 553/453 Fall 2020 Prof. Dola Saha 1 Why do we

A time to digital converter implemented on a ROACH2 board Genady Pilyavsky (1,2) Adrian

Non-slicing Floorplanning-based Crosstalk Reduction on Gridless Track Assignment for a Gridless

FE65-P2 Tuning & More Instrumentation Meeting - 20th May 2016 Timon Heim - LBNL UNIVERSITY

Effects of non-Gaussian noise on covariance-based detectors Toma olc tomaz.solc@ijs.si

Design and Modeling of a 0.4mW/Ch Multi-Channel Integrated Circuit for Infrared Gas Recognition

Arvind NV, Krishna Panda, Anthony Hill Texas Instruments Inc. March 2014 Outline Motivation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Title: Using Chinese Traditional Crosstalk in Academic Development: the Case of Shanghai Jiao

Cyber-Physical Systems Communication IECE 553/453 Fall 2020 Prof. Dola Saha 1 Why do we

A time to digital converter implemented on a ROACH2 board Genady Pilyavsky (1,2) Adrian

Non-slicing Floorplanning-based Crosstalk Reduction on Gridless Track Assignment for a Gridless

FE65-P2 Tuning &amp; More Instrumentation Meeting - 20th May 2016 Timon Heim - LBNL UNIVERSITY

Effects of non-Gaussian noise on covariance-based detectors Toma olc tomaz.solc@ijs.si

Design and Modeling of a 0.4mW/Ch Multi-Channel Integrated Circuit for Infrared Gas Recognition

Arvind NV, Krishna Panda, Anthony Hill Texas Instruments Inc. March 2014 Outline Motivation

FE65-P2 Tuning & More Instrumentation Meeting - 20th May 2016 Timon Heim - LBNL UNIVERSITY