neural networks
play

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


  1. Only N patterns? (1,1) (1,-1) β€’ Patterns that differ in 𝑂/2 bits are orthogonal β€’ You can have max 𝑂 orthogonal vectors in an 𝑂 -dimensional space 33

  2. Another random fact that should interest you β€’ The Eigenvectors of any symmetric matrix 𝐗 are orthogonal β€’ The Eigen values may be positive or negative 34

  3. Storing more than one pattern β€’ Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that β€’ π‘‘π‘—π‘•π‘œ 𝐗𝐳 π‘ž = 𝐳 π‘ž for all target patterns β€’ There are no other binary vectors for which this holds β€’ What is the largest number of patterns that can be stored? 35

  4. Storing 𝑳 orthogonal patterns β€’ Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐗 = 𝐙Λ𝐙 π‘ˆ – πœ‡ 1 , … , πœ‡ 𝐿 are positive – For πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 this is exactly the Hebbian rule β€’ The patterns are provably stationary 36

  5. Hebbian rule β€’ In reality – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝐗 = 𝐙Λ𝐙 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 37

  6. Storing 𝑢 orthogonal patterns β€’ When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝐗 = 𝐙Λ𝐙 π‘ˆ – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝑂 = 1 β€’ The Eigen vectors of 𝐗 span the space β€’ Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 38

  7. Storing 𝑢 orthogonal patterns β€’ The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space β€’ Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + β‹― + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 = 𝐳 β€’ All patterns are stable – Remembers everything – Completely useless network 39

  8. Storing K orthogonal patterns β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary β€’ Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) β€’ Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 40

  9. Problem with Hebbian Rule β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 β€’ Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 41

  10. Hebbian rule and general (non- orthogonal) vectors π‘ž 𝑧 π‘˜ π‘ž π‘₯ π‘˜π‘— = ෍ 𝑧 𝑗 π‘žβˆˆ{π‘ž} β€’ What happens when the patterns are not orthogonal β€’ What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. β€’ Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations π‘ˆ β€’ Can write 𝐙 𝑄 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂 , οƒ  𝐗 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂Λ𝐂 π‘ˆ 𝐙 π‘π‘ π‘’β„Žπ‘ 42

  11. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 43

  12. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 44

  13. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 45

  14. A different tack β€’ How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization β€’ Secondary question – How many patterns can we store? 46

  15. Consider the energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ This must be maximally low for target patterns β€’ Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 47

  16. Alternate Approach to Estimating the Network 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 β€’ Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 48

  17. Optimizing W (and b) 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 The bias can be captured by another fixed-value component β€’ Minimize total energy of target patterns – Problem with this? 49

  18. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Minimize total energy of target patterns β€’ Maximize the total energy of all non-target patterns 50

  19. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 51

  20. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis 52

  21. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis β€’ How many of these? – Do we need to include all of them? – Are all equally important? 53

  22. The training again.. 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 54 state

  23. The training again 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more β€œimportant” memories by repeating them more frequently Target patterns Energy 55 state

  24. The negative class 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The second term tries to β€œraise” all non -target patterns – Do we need to raise everything ? Energy 56 state

  25. Option 1: Focus on the valleys 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 57 state

  26. Identifying the valleys.. 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Problem: How do you identify the valleys for the current 𝐗 ? Energy 58 state

  27. Identifying the valleys.. β€’ Initialize the network randomly and let it evolve – It will settle in a valley Energy 59 state

  28. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Randomly initialize the network several times and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 60

  29. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 61

  30. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 62

  31. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? Energy 63 state

  32. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? β€’ Major requirement: memories must be stable – They must be broad valleys β€’ Spurious valleys in the neighborhood of memories are more important to eliminate Energy 64 state

  33. Identifying the valleys.. β€’ Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 65 state

  34. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Initialize the network with each target pattern and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 66

  35. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 67

  36. A possible problem β€’ What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 68 state

  37. A related issue β€’ Really no need to raise the entire surface, or even every valley Energy 69 state

  38. A related issue β€’ Really no need to raise the entire surface, or even every valley β€’ Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 70 state

  39. Raising the neighborhood β€’ Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location β€’ Will raise the neighborhood of targets β€’ Will avoid problem of down-valley targets Energy 71 state

  40. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve a few steps (2- 4) β€’ And arrive at a down-valley position 𝐳 𝑒 – Update weights π‘ˆ βˆ’ 𝐳 𝑒 𝐳 𝑒 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 72

  41. A probabilistic interpretation 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 β€’ For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density β€’ For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 73

  42. The Boltzmann Distribution 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ 𝑙 is the Boltzmann constant β€’ π‘ˆ is the temperature of the system β€’ The energy terms are like the loglikelihood of a Boltzmann distribution at π‘ˆ = 1 – Derivation of this probability is in fact quite trivial.. 74

  43. Continuing the Boltzmann analogy 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at π‘ˆ = 0, it arrives at the global minimal state 75

  44. Spin glasses and Hopfield nets Energy state β€’ Selecting a next state is akin to drawing a sample from the Boltzmann distribution at π‘ˆ = 1, in a universe where 𝑙 = 1 76

  45. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories 77

  46. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories THIS LOOKS LIKE AN EXPECTATION! 78

  47. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Update rule 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 𝐗 = 𝐗 + πœƒ 𝐹 𝐳~𝐙 𝑄 𝐳𝐳 π‘ˆ βˆ’ 𝐹 𝐳~𝑍 𝐳𝐳 π‘ˆ Natural distribution for variables: The Boltzmann Distribution 79

  48. Continuing on.. β€’ The Hopfield net as a Boltzmann distribution β€’ Adding capacity to a Hopfield network – The Boltzmann machine 80

  49. Continuing on.. β€’ The Hopfield net as a Boltzmann distribution β€’ Adding capacity to a Hopfield network – The Boltzmann machine 81

  50. Storing more than N patterns β€’ The memory capacity of an 𝑂 -bit network is at most 𝑂 – Stable patterns (not necessarily even stationary) β€’ Abu Mustafa and St. Jacques, 1985 β€’ Although β€œinformation capacity” is 𝒫(𝑂 3 ) β€’ How do we increase the capacity of the network – Store more patterns 82

  51. Expanding the network K Neurons N Neurons β€’ Add a large number of neurons whose actual values you don’t care about! 83

  52. Expanded Network K Neurons N Neurons β€’ New capacity: ~(𝑂 + 𝐿) patterns – Although we only care about the pattern of the first N neurons – We’re interested in N-bit patterns 84

  53. Terminology Hidden Visible Neurons Neurons β€’ Terminology: – The neurons that store the actual patterns of interest: Visible neurons – The neurons that only serve to increase the capacity but whose actual values are not important: Hidden neurons – These can be set to anything in order to store a visible pattern

  54. Training the network Hidden Visible Neurons Neurons β€’ For a given pattern of visible neurons, there are any number of hidden patterns (2 K ) β€’ Which of these do we choose? – Ideally choose the one that results in the lowest energy – But that’s an exponential search space! β€’ Solution: Combinatorial optimization – Simulated annealing

  55. The patterns β€’ In fact we could have multiple hidden patterns coupled with any visible pattern – These would be multiple stored patterns that all give the same visible output – How many do we permit β€’ Do we need to specify one or more particular hidden patterns? – How about all of them – What do I mean by this bizarre statement?

  56. But first.. β€’ The Hopfield net as a distribution.. 88

  57. Revisiting Thermodynamic Phenomena PE state β€’ Is the system actually in a specific state at any time? β€’ No – the state is actually continuously changing – Based on the temperature of the system β€’ At higher temperatures, state changes more rapidly β€’ What is actually being characterized is the probability of the state – And the expected value of the state

  58. The Helmholtz Free Energy of a System β€’ A thermodynamic system at temperature π‘ˆ can exist in one of many states – Potentially infinite states – At any time, the probability of finding the system in state 𝑑 at temperature π‘ˆ is 𝑄 π‘ˆ (𝑑) β€’ At each state 𝑑 it has a potential energy 𝐹 𝑑 β€’ The internal energy of the system, representing its capacity to do work, is the average: 𝑉 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 𝑑

  59. The Helmholtz Free Energy of a System β€’ The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy 𝐼 π‘ˆ = βˆ’ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 β€’ The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms 𝐺 π‘ˆ = 𝑉 π‘ˆ + π‘™π‘ˆπΌ π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑

  60. The Helmholtz Free Energy of a System 𝐺 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑 β€’ A system held at a specific temperature anneals by varying the rate at which it visits the various states, to reduce the free energy in the system, until a minimum free-energy state is achieved β€’ The probability distribution of the states at steady state is known as the Boltzmann distribution

  61. The Helmholtz Free Energy of a System 𝐺 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑 β€’ Minimizing this w.r.t 𝑄 π‘ˆ 𝑑 , we get 𝑄 π‘ˆ 𝑑 = 1 π‘Ž π‘“π‘¦π‘ž βˆ’πΉ 𝑑 π‘™π‘ˆ – Also known as the Gibbs distribution – π‘Ž is a normalizing constant – Note the dependence on π‘ˆ – A π‘ˆ = 0, the system will always remain at the lowest- energy configuration with prob = 1.

  62. The Energy of the Network Visible 𝐹 𝑇 = βˆ’ ෍ π‘₯ π‘—π‘˜ 𝑑 𝑗 𝑑 π‘˜ βˆ’ 𝑐 𝑗 𝑑 𝑗 Neurons 𝑗<π‘˜ π‘“π‘¦π‘ž βˆ’πΉ(𝑇) 𝑄 𝑇 = Οƒ 𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′) β€’ We can define the energy of the system as before β€’ Since neurons are stochastic, there is disorder or entropy (with T = 1) β€’ The equilibribum probability distribution over states is the Boltzmann distribution at T=1 – This is the probability of different states that the network will wander over at equilibrium

  63. The Hopfield net is a distribution Visible 𝐹 𝑇 = βˆ’ ෍ π‘₯ π‘—π‘˜ 𝑑 𝑗 𝑑 π‘˜ βˆ’ 𝑐 𝑗 𝑑 𝑗 Neurons 𝑗<π‘˜ π‘“π‘¦π‘ž βˆ’πΉ(𝑇) 𝑄 𝑇 = Οƒ 𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′) β€’ The stochastic Hopfield network models a probability distribution over states – Where a state is a binary string – Specifically, it models a Boltzmann distribution – The parameters of the model are the weights of the network β€’ The probability that (at equilibrium) the network will be in any state is 𝑄 𝑇 – It is a generative model: generates states according to 𝑄 𝑇

  64. The field at a single node β€’ Let 𝑇 and 𝑇 β€² be otherwise identical states that only differ in the i-th bit – S has i-th bit = +1 and S’ has i-th bit = βˆ’1 𝑄 𝑇 = 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— 𝑄(𝑑 π‘˜β‰ π‘— ) 𝑄 𝑇′ = 𝑄 𝑑 𝑗 = βˆ’1 𝑑 π‘˜β‰ π‘— 𝑄(𝑑 π‘˜β‰ π‘— ) π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = π‘šπ‘π‘•π‘„ 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— βˆ’ π‘šπ‘π‘•π‘„ 𝑑 𝑗 = 0 𝑑 π‘˜β‰ π‘— 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = π‘šπ‘π‘• 1 βˆ’ 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— 96

  65. The field at a single node β€’ Let 𝑇 and 𝑇 β€² be the states with the ith bit in the +1 and βˆ’ 1 states log 𝑄(𝑇) = βˆ’πΉ 𝑇 + 𝐷 𝐹 𝑇 = βˆ’ 1 2 𝐹 π‘œπ‘π‘’ 𝑗 + ෍ π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 π‘˜β‰ π‘— 𝐹 𝑇′ = βˆ’ 1 2 𝐹 π‘œπ‘π‘’ 𝑗 βˆ’ ෍ π‘₯ π‘˜ 𝑑 π‘˜ βˆ’ 𝑐 𝑗 π‘˜β‰ π‘— β€’ π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = 𝐹 𝑇 β€² βˆ’ 𝐹 𝑇 = Οƒ π‘˜β‰ π‘— π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 97

  66. The field at a single node 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘šπ‘π‘• = ෍ π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 1 βˆ’ 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘˜β‰ π‘— β€’ Giving us 1 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— = 1 + 𝑓 βˆ’ Οƒ π‘˜β‰ π‘— π‘₯ π‘˜ 𝑑 π‘˜ +𝑐 𝑗 β€’ The probability of any node taking value 1 given other node values is a logistic 98

  67. Redefining the network 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ First try: Redefine a regular Hopfield net as a stochastic system β€’ Each neuron is now a stochastic unit with a binary state 𝑑 𝑗 , which can take value 0 or 1 with a probability that depends on the local field – Note the slight change from Hopfield nets – Not actually necessary; only a matter of convenience

  68. The Hopfield net is a distribution 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ The Hopfield net is a probability distribution over binary sequences – The Boltzmann distribution β€’ The conditional distribution of individual bits in the sequence is a logistic

  69. Running the network 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ Initialize the neurons β€’ Cycle through the neurons and randomly set the neuron to 1 or -1 according to the probability given above – Gibbs sampling: Fix N-1 variables and sample the remaining variable – As opposed to energy-based update (mean field approximation): run the test z i > 0 ? β€’ After many many iterations (until β€œconvergence”), sample the individual neurons

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend