neural networks

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


  1. Only N patterns? (1,1) (1,-1) • Patterns that differ in 𝑂/2 bits are orthogonal • You can have max 𝑂 orthogonal vectors in an 𝑂 -dimensional space 33

  2. Another random fact that should interest you • The Eigenvectors of any symmetric matrix 𝐗 are orthogonal • The Eigen values may be positive or negative 34

  3. Storing more than one pattern • Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that • 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • There are no other binary vectors for which this holds • What is the largest number of patterns that can be stored? 35

  4. Storing 𝑳 orthogonal patterns • Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐗 = 𝐙Λ𝐙 𝑈 – 𝜇 1 , … , 𝜇 𝐿 are positive – For 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 this is exactly the Hebbian rule • The patterns are provably stationary 36

  5. Hebbian rule • In reality – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝐗 = 𝐙Λ𝐙 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 37

  6. Storing 𝑶 orthogonal patterns • When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝐗 = 𝐙Λ𝐙 𝑈 – 𝜇 1 = 𝜇 2 = 𝜇 𝑂 = 1 • The Eigen vectors of 𝐗 span the space • Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 38

  7. Storing 𝑶 orthogonal patterns • The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space • Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + ⋯ + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 = 𝐳 • All patterns are stable – Remembers everything – Completely useless network 39

  8. Storing K orthogonal patterns • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary • Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) • Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 40

  9. Problem with Hebbian Rule • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 • Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 41

  10. Hebbian rule and general (non- orthogonal) vectors 𝑞 𝑧 𝑘 𝑞 𝑥 𝑘𝑗 = ෍ 𝑧 𝑗 𝑞∈{𝑞} • What happens when the patterns are not orthogonal • What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. • Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations 𝑈 • Can write 𝐙 𝑄 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂 ,  𝐗 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂Λ𝐂 𝑈 𝐙 𝑝𝑠𝑢ℎ𝑝 42

  11. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many “parasitic” memories 43

  12. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many “parasitic” memories 44

  13. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many “parasitic” memories 45

  14. A different tack • How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization • Secondary question – How many patterns can we store? 46

  15. Consider the energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • This must be maximally low for target patterns • Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 47

  16. Alternate Approach to Estimating the Network 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 • Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 48

  17. Optimizing W (and b) 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 The bias can be captured by another fixed-value component • Minimize total energy of target patterns – Problem with this? 49

  18. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Minimize total energy of target patterns • Maximize the total energy of all non-target patterns 50

  19. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Simple gradient descent: 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 51

  20. Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis 52

  21. Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis • How many of these? – Do we need to include all of them? – Are all equally important? 53

  22. The training again.. 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 54 state

  23. The training again 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more “important” memories by repeating them more frequently Target patterns Energy 55 state

  24. The negative class 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The second term tries to “raise” all non -target patterns – Do we need to raise everything ? Energy 56 state

  25. Option 1: Focus on the valleys 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 57 state

  26. Identifying the valleys.. 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Problem: How do you identify the valleys for the current 𝐗 ? Energy 58 state

  27. Identifying the valleys.. • Initialize the network randomly and let it evolve – It will settle in a valley Energy 59 state

  28. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Randomly initialize the network several times and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 60

  29. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 61

  30. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 62

  31. Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? Energy 63 state

  32. Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? • Major requirement: memories must be stable – They must be broad valleys • Spurious valleys in the neighborhood of memories are more important to eliminate Energy 64 state

  33. Identifying the valleys.. • Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 65 state

  34. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Initialize the network with each target pattern and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 66

  35. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 67

  36. A possible problem • What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 68 state

  37. A related issue • Really no need to raise the entire surface, or even every valley Energy 69 state

  38. A related issue • Really no need to raise the entire surface, or even every valley • Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 70 state

  39. Raising the neighborhood • Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location • Will raise the neighborhood of targets • Will avoid problem of down-valley targets Energy 71 state

  40. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve a few steps (2- 4) • And arrive at a down-valley position 𝐳 𝑒 – Update weights 𝑈 − 𝐳 𝑒 𝐳 𝑒 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 72

  41. A probabilistic interpretation 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 • For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density • For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 73

  42. The Boltzmann Distribution 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • 𝑙 is the Boltzmann constant • 𝑈 is the temperature of the system • The energy terms are like the loglikelihood of a Boltzmann distribution at 𝑈 = 1 – Derivation of this probability is in fact quite trivial.. 74

  43. Continuing the Boltzmann analogy 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at 𝑈 = 0, it arrives at the global minimal state 75

  44. Spin glasses and Hopfield nets Energy state • Selecting a next state is akin to drawing a sample from the Boltzmann distribution at 𝑈 = 1, in a universe where 𝑙 = 1 76

  45. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Simple gradient descent: 𝛽 𝐳 𝐳𝐳 𝑈 − ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories 77

  46. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Simple gradient descent: 𝛽 𝐳 𝐳𝐳 𝑈 − ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories THIS LOOKS LIKE AN EXPECTATION! 78

  47. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Update rule 𝛽 𝐳 𝐳𝐳 𝑈 − ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 𝐗 = 𝐗 + 𝜃 𝐹 𝐳~𝐙 𝑄 𝐳𝐳 𝑈 − 𝐹 𝐳~𝑍 𝐳𝐳 𝑈 Natural distribution for variables: The Boltzmann Distribution 79

  48. Continuing on.. • The Hopfield net as a Boltzmann distribution • Adding capacity to a Hopfield network – The Boltzmann machine 80

  49. Continuing on.. • The Hopfield net as a Boltzmann distribution • Adding capacity to a Hopfield network – The Boltzmann machine 81

  50. Storing more than N patterns • The memory capacity of an 𝑂 -bit network is at most 𝑂 – Stable patterns (not necessarily even stationary) • Abu Mustafa and St. Jacques, 1985 • Although “information capacity” is 𝒫(𝑂 3 ) • How do we increase the capacity of the network – Store more patterns 82

  51. Expanding the network K Neurons N Neurons • Add a large number of neurons whose actual values you don’t care about! 83

  52. Expanded Network K Neurons N Neurons • New capacity: ~(𝑂 + 𝐿) patterns – Although we only care about the pattern of the first N neurons – We’re interested in N-bit patterns 84

  53. Terminology Hidden Visible Neurons Neurons • Terminology: – The neurons that store the actual patterns of interest: Visible neurons – The neurons that only serve to increase the capacity but whose actual values are not important: Hidden neurons – These can be set to anything in order to store a visible pattern

  54. Training the network Hidden Visible Neurons Neurons • For a given pattern of visible neurons, there are any number of hidden patterns (2 K ) • Which of these do we choose? – Ideally choose the one that results in the lowest energy – But that’s an exponential search space! • Solution: Combinatorial optimization – Simulated annealing

  55. The patterns • In fact we could have multiple hidden patterns coupled with any visible pattern – These would be multiple stored patterns that all give the same visible output – How many do we permit • Do we need to specify one or more particular hidden patterns? – How about all of them – What do I mean by this bizarre statement?

  56. But first.. • The Hopfield net as a distribution.. 88

  57. Revisiting Thermodynamic Phenomena PE state • Is the system actually in a specific state at any time? • No – the state is actually continuously changing – Based on the temperature of the system • At higher temperatures, state changes more rapidly • What is actually being characterized is the probability of the state – And the expected value of the state

  58. The Helmholtz Free Energy of a System • A thermodynamic system at temperature 𝑈 can exist in one of many states – Potentially infinite states – At any time, the probability of finding the system in state 𝑡 at temperature 𝑈 is 𝑄 𝑈 (𝑡) • At each state 𝑡 it has a potential energy 𝐹 𝑡 • The internal energy of the system, representing its capacity to do work, is the average: 𝑉 𝑈 = ෍ 𝑄 𝑈 𝑡 𝐹 𝑡 𝑡

  59. The Helmholtz Free Energy of a System • The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy 𝐼 𝑈 = − ෍ 𝑄 𝑈 𝑡 log 𝑄 𝑈 𝑡 𝑡 • The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms 𝐺 𝑈 = 𝑉 𝑈 + 𝑙𝑈𝐼 𝑈 = ෍ 𝑄 𝑈 𝑡 𝐹 𝑡 − 𝑙𝑈 ෍ 𝑄 𝑈 𝑡 log 𝑄 𝑈 𝑡 𝑡 𝑡

  60. The Helmholtz Free Energy of a System 𝐺 𝑈 = ෍ 𝑄 𝑈 𝑡 𝐹 𝑡 − 𝑙𝑈 ෍ 𝑄 𝑈 𝑡 log 𝑄 𝑈 𝑡 𝑡 𝑡 • A system held at a specific temperature anneals by varying the rate at which it visits the various states, to reduce the free energy in the system, until a minimum free-energy state is achieved • The probability distribution of the states at steady state is known as the Boltzmann distribution

  61. The Helmholtz Free Energy of a System 𝐺 𝑈 = ෍ 𝑄 𝑈 𝑡 𝐹 𝑡 − 𝑙𝑈 ෍ 𝑄 𝑈 𝑡 log 𝑄 𝑈 𝑡 𝑡 𝑡 • Minimizing this w.r.t 𝑄 𝑈 𝑡 , we get 𝑄 𝑈 𝑡 = 1 𝑎 𝑓𝑦𝑞 −𝐹 𝑡 𝑙𝑈 – Also known as the Gibbs distribution – 𝑎 is a normalizing constant – Note the dependence on 𝑈 – A 𝑈 = 0, the system will always remain at the lowest- energy configuration with prob = 1.

  62. The Energy of the Network Visible 𝐹 𝑇 = − ෍ 𝑥 𝑗𝑘 𝑡 𝑗 𝑡 𝑘 − 𝑐 𝑗 𝑡 𝑗 Neurons 𝑗<𝑘 𝑓𝑦𝑞 −𝐹(𝑇) 𝑄 𝑇 = σ 𝑇′ 𝑓𝑦𝑞 −𝐹(𝑇′) • We can define the energy of the system as before • Since neurons are stochastic, there is disorder or entropy (with T = 1) • The equilibribum probability distribution over states is the Boltzmann distribution at T=1 – This is the probability of different states that the network will wander over at equilibrium

  63. The Hopfield net is a distribution Visible 𝐹 𝑇 = − ෍ 𝑥 𝑗𝑘 𝑡 𝑗 𝑡 𝑘 − 𝑐 𝑗 𝑡 𝑗 Neurons 𝑗<𝑘 𝑓𝑦𝑞 −𝐹(𝑇) 𝑄 𝑇 = σ 𝑇′ 𝑓𝑦𝑞 −𝐹(𝑇′) • The stochastic Hopfield network models a probability distribution over states – Where a state is a binary string – Specifically, it models a Boltzmann distribution – The parameters of the model are the weights of the network • The probability that (at equilibrium) the network will be in any state is 𝑄 𝑇 – It is a generative model: generates states according to 𝑄 𝑇

  64. The field at a single node • Let 𝑇 and 𝑇 ′ be otherwise identical states that only differ in the i-th bit – S has i-th bit = +1 and S’ has i-th bit = −1 𝑄 𝑇 = 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 𝑄(𝑡 𝑘≠𝑗 ) 𝑄 𝑇′ = 𝑄 𝑡 𝑗 = −1 𝑡 𝑘≠𝑗 𝑄(𝑡 𝑘≠𝑗 ) 𝑚𝑝𝑕𝑄 𝑇 − 𝑚𝑝𝑕𝑄 𝑇 ′ = 𝑚𝑝𝑕𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 − 𝑚𝑝𝑕𝑄 𝑡 𝑗 = 0 𝑡 𝑘≠𝑗 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 𝑚𝑝𝑕𝑄 𝑇 − 𝑚𝑝𝑕𝑄 𝑇 ′ = 𝑚𝑝𝑕 1 − 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 96

  65. The field at a single node • Let 𝑇 and 𝑇 ′ be the states with the ith bit in the +1 and − 1 states log 𝑄(𝑇) = −𝐹 𝑇 + 𝐷 𝐹 𝑇 = − 1 2 𝐹 𝑜𝑝𝑢 𝑗 + ෍ 𝑥 𝑘 𝑡 𝑘 + 𝑐 𝑗 𝑘≠𝑗 𝐹 𝑇′ = − 1 2 𝐹 𝑜𝑝𝑢 𝑗 − ෍ 𝑥 𝑘 𝑡 𝑘 − 𝑐 𝑗 𝑘≠𝑗 • 𝑚𝑝𝑕𝑄 𝑇 − 𝑚𝑝𝑕𝑄 𝑇 ′ = 𝐹 𝑇 ′ − 𝐹 𝑇 = σ 𝑘≠𝑗 𝑥 𝑘 𝑡 𝑘 + 𝑐 𝑗 97

  66. The field at a single node 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 𝑚𝑝𝑕 = ෍ 𝑥 𝑘 𝑡 𝑘 + 𝑐 𝑗 1 − 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 𝑘≠𝑗 • Giving us 1 𝑄 𝑡 𝑗 = 1 𝑡 𝑘≠𝑗 = 1 + 𝑓 − σ 𝑘≠𝑗 𝑥 𝑘 𝑡 𝑘 +𝑐 𝑗 • The probability of any node taking value 1 given other node values is a logistic 98

  67. Redefining the network 𝑨 𝑗 = ෍ 𝑥 𝑘𝑗 𝑡 𝑘 + 𝑐 𝑗 Visible Neurons 𝑘 1 𝑄(𝑡 𝑗 = 1|𝑡 𝑘≠𝑗 ) = 1 + 𝑓 −𝑨 𝑗 • First try: Redefine a regular Hopfield net as a stochastic system • Each neuron is now a stochastic unit with a binary state 𝑡 𝑗 , which can take value 0 or 1 with a probability that depends on the local field – Note the slight change from Hopfield nets – Not actually necessary; only a matter of convenience

  68. The Hopfield net is a distribution 𝑨 𝑗 = ෍ 𝑥 𝑘𝑗 𝑡 𝑘 + 𝑐 𝑗 Visible Neurons 𝑘 1 𝑄(𝑡 𝑗 = 1|𝑡 𝑘≠𝑗 ) = 1 + 𝑓 −𝑨 𝑗 • The Hopfield net is a probability distribution over binary sequences – The Boltzmann distribution • The conditional distribution of individual bits in the sequence is a logistic

  69. Running the network 𝑨 𝑗 = ෍ 𝑥 𝑘𝑗 𝑡 𝑘 + 𝑐 𝑗 Visible Neurons 𝑘 1 𝑄(𝑡 𝑗 = 1|𝑡 𝑘≠𝑗 ) = 1 + 𝑓 −𝑨 𝑗 • Initialize the neurons • Cycle through the neurons and randomly set the neuron to 1 or -1 according to the probability given above – Gibbs sampling: Fix N-1 variables and sample the remaining variable – As opposed to energy-based update (mean field approximation): run the test z i > 0 ? • After many many iterations (until “convergence”), sample the individual neurons

Recommend


More recommend