simulated annealing
play

Simulated Annealing input : ( x 1 , t 1 ) , . . . , ( x N , t N ) R - PowerPoint PPT Presentation

Simulated Annealing input : ( x 1 , t 1 ) , . . . , ( x N , t N ) R d { 1 , +1 } ; T start , T stop R output : w begin Randomly initialize w T T start repeat w N ( w ) //neighbors of w , e.g. by adding Gaussion noise ( N


  1. Simulated Annealing input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {− 1 , +1 } ; T start , T stop ∈ R output : w begin Randomly initialize w T ← T start repeat w ← N ( w ) //neighbors of w , e.g. by adding � Gaussion noise ( N (0 , σ )) if E ( � w ) < E ( w ) then w ← � w � � − E ( b w ) − E ( w ) else if exp > rand [0 , 1) then T w ← � w decrease ( T ) until T < T stop return w end – p. 156

  2. Continuous Hopfield Network Let us consider our previously defined Hopfield network (identical architecture and learning rule), however with following activity rule   �  1  S i = tanh T · w ij S j j Start with a large (temperature) value of T and decrease it by some magnitude whenever a unit is updated (deterministic simulated annealing). This type of Hopfield network can approximate the probability distribution � 1 � 1 1 2 x T Wx P ( x | W ) = Z ( W ) exp[ − E ( x )] = Z ( W ) exp – p. 157

  3. Continuous Hopfield Network � exp( − E ( x ′ )) (sum over all possible states) Z ( W ) = x ′ is the partition function and ensures P ( x | W ) is a probability distribution. Idea: construct a stochastic Hopfield network that implements the probability distribution P ( x | W ) . • Learn a model that is capable of generating patterns from that unknown distribution. • Quantify (classify) by means of probabilities seen and unseen patterns. • If needed, we can generate more patterns (generative model). – p. 158

  4. Boltzmann Machines Given patterns { x ( n ) } N 1 , we want to learn the weights such that the generative model � 1 � 1 2 x T Wx P ( x | W ) = Z ( W ) exp is well matched to those patterns. The states are updated according to the stochastic rule: 1 • set x n = +1 with probability 1+exp ( − 2 P j w ij x j ) • else set x n = − 1 . Posterior probability of the weights given the data (Bayes’ theorem) �� N � n =1 P ( x ( n ) | W ) P ( W ) P ( W |{ x ( n ) } N 1 ) = P ( { x ( n ) } N 1 ) – p. 159

  5. Boltzmann Machines Apply maximum likelihood method on the first term in numerator: � N � � 1 � N � � 2 x ( n ) T Wx ( n ) − ln Z ( W ) P ( x ( n ) | W ) ln = n =1 n =1 Taking derivative of the log likelihood gives: note that W is 2 x ( n ) T Wx ( n ) = x ( n ) x ( n ) ∂ 1 symmetric ( w ij = w ji ) that is i j ∂w ij and ∂ � 1 � � 2 x ( n ) T Wx ( n ) 1 ∂ ln Z ( W ) = exp Z ( w ) ∂w ij ∂w ij x � 1 � � 1 2 x ( n ) T Wx ( n ) = exp x i x j Z ( W ) x � x i x j P ( x | W ) = � x i x j � P ( x | W ) = x – p. 160

  6. Boltzmann Machines (cont.) N � � � ∂ x ( n ) x ( n ) ln P ( { x ( n ) } N 1 | W ) − � x i x j � P ( x | W ) = i j ∂w ij n =1 � � = � x i x j � Data − � x i x j � P ( x | W ) N Empirical correlation between x i and x j � � N � � x i x j � Data ≡ 1 x ( n ) x ( n ) i j N n =1 Correlation between x i and x j under the current model � � x i x j � P ( x | W ) ≡ x i x j P ( x | W ) x – p. 161

  7. Interpretation of Boltzmann Machines Learning Illustrative description (MacKay’s book, pp. 523): • Awake state: measure correlation between x i and x j in the real world, and increase the weights in proportion to the measured correlations. • Sleep state: dream about the world using the generative model P ( x | W ) and measure the correlation between x i and x j in the model world. Use these correlations to determine a proportional decrease in the weights. If correlations in dream world and real world are matching, then the two terms balanced and weights do not change. – p. 162

  8. Boltzmann Machines with Hidden Units To model higher order correlations hidden units are required. • x : states of visible units, • h : states of hidden units, • generic state of a unit (either visible or hidden) by y i , with y ≡ ( x , h ) , • state of network when visible units are clamped in state x ( n ) is y ( n ) ≡ ( x ( n ) , h ) . Probability of W given a single pattern x ( n ) is � 1 � � � 1 2 y ( n ) T Wy ( n ) P ( x ( n ) | W ) = P ( x ( n ) , h | W ) = Z ( W ) exp h h � 1 � � where 2 y T Wy Z ( W ) = exp x , h – p. 163

  9. Boltzmann Machines with Hidden Units (cont.) Applying the maximum likelihood method as before one obtains   �   ∂ ln P ( { x ( n ) } N   1 | W ) =  � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W )  ∂w ij � �� � � �� � n free clamped to x ( n ) Term � y i y j � P ( h | x ( n ) , W ) is the correlation between y i and y j when Boltzmann machine is simulated with visible variables clamped to x ( n ) and hidden variables freely sampling from their conditional distribution. Term � y i y j � P ( h | x , W ) is the correlation between y i and y j when the Boltzmann machine generates samples from its model distribution. – p. 164

  10. Boltzmann Machines with Input-Hidden-Output The so far considered Boltzmann machine is a powerful stochastic Hopfield network with no ability to perform classification. Let us introduce visible input and output units: • x ≡ ( x i , x o ) Note that pattern x ( n ) consists of an input and output part, � � that is, x ( n ) ≡ x ( n ) , x ( n ) . o i   �      � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W )  � �� � � �� � n clamped to ( x ( n ) , x ( n ) clamped to x ( n ) ) o i i – p. 165

  11. Boltzmann Machines Updates Weights Combine gradient descent and simulated annealing to update weights     ∆ w ij = η    � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W )  T � �� � � �� � clamped to x ( n ) clamped to ( x ( n ) , x ( n ) ) o i i High computational complexity: • present each pattern several times • anneal several times Mean-field version of Boltzmann learning: • calculate approximations of the correlations ( [ y i y j ] ) entering the gradient – p. 166

  12. Deterministic Boltzmann Learning input : { x ( n ) } N 1 ; η, T start , T stop ∈ R output : W begin T ← T start repeat randomly select pattern from sample { x ( n ) } N 1 randomize states anneal network with input and output clamped at final, low T , calculate [ y i y j ] x i , x o clamped randomize states anneal network with input clamped but output free at final, low T, calculate [ y i y j ] x i clamped � � w ij ← w ij + η/T [ y i y j ] x i , x o clamped ] − [ y i y j ] x i clamped until T < T stop return w end – p. 167

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend