Neural Networks
Hopfield Nets and Boltzmann Machines Fall 2017
1
Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - - PowerPoint PPT Presentation
Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network & $ = + % "$ & " + ( $ "#$ , = -+1 /0 , > 0 1 /0 , 0 At each time each neuron receives a field
1
"$&" + ($
"#$
"$&" + ($
2
%,'(%
%
– Bias term may be viewed as an extra input pegged to 1.0
3
';%
'%*' + +%
%
&'%
&%)&)%
)% 0 = +%, 0 ≤ . ≤ / − 1
)% 1 + 1 = Θ $
&4%
(
&%)&
, 0 ≤ . ≤ / − 1
4
state PE 5
state PE
6
10
– E.g. the pictures in the previous example
– But can store an exponential number of unwanted “parasitic” memories along with the target patterns
– Target patterns is minimized, so that they are in energy wells – Other untargeted potentially parasitic patterns is maximized so that they don’t become parasitic
11
12
state Energy Minimize energy of target patterns Maximize energy of all other patterns
"
+∈-.
+∉-.
*
#∈45
#∉45
#∈45
#∉45
Minimize energy of target patterns Maximize energy of all other patterns
14
&∈()
&∉()
state Energy Minimize energy of target patterns Maximize energy of all other patterns
15
&∈()
&∉()&&./01123
state Energy
16
state Energy
– It will settle in a valley. If this is not the target pattern, raise it
&∈()
&∉()&&./01123
17
&∈()
&∉()&&./01123
( − "$"$ (
18
"∈,-
"∉,-&"0$12234
19
state Energy
( − "$"$ (
20
"∈,-
"∉,-&"0123345
21
22
state Energy Parasitic memories
23
'("
'"*'
24
The field quantifies the energy difference obtained by flipping the current unit
*+#
*#"*
25
If the difference is not large, the probability of flipping approaches 0.5
*+#
*#"*
The field quantifies the energy difference obtained by flipping the current unit
26
*+#
*#"*
If the difference is not large, the probability of flipping approaches 0.5 The field quantifies the energy difference obtained by flipping the current unit T is a “temperature” parameter: increasing it moves the probability of the bits towards 0.5 At T=1.0 we get the traditional definition of field and energy At T = 0, we get deterministic Hopfield behavior
!" 0 = %", 0 ≤ ( ≤ ) − 1
, = - .
/0"
1
/"!/
!" 2 + 1 ~ 5(678(9:(,)
27
Assuming T = 1
!" 0 = %", 0 ≤ ( ≤ ) − 1
, = - .
/0"
1
/"!/
!" 2 + 1 ~ 5(678(9:(,)
28
Assuming T = 1
!" 0 = %", 0 ≤ ( ≤ ) − 1
, = - .
/0"
1
/"!/
!" 2 + 1 ~ 5(678(9:(,)
29
Assuming T = 1
! = 1 + ,
'
!- > 0? – Estimates the probability that the bit is 1.0. – If it is greater than 0.5, sets it to 1.0
56 0 = 76, 0 ≤ 9 ≤ : − 1
< = = ,
>?6
@
>65>
56 A + 1 ~ D9EFG9HI(<)
30
Assuming T = 1
! = 1 + ,
'
!- > 0?
56 0 = 76, 0 ≤ 9 ≤ : − 1
i. For iter 1. . ( a) For 0 ≤ 9 ≤ : − 1
E = F 1 < ,
GH6
?
G65G
56 A + 1 ~ K9@>L9MN(E)
31
! = 1 + ,
'
!- > 0?
56 0 = 76, 0 ≤ 9 ≤ : − 1
i. For iter 1. . ( a) For 0 ≤ 9 ≤ : − 1
E = F 1 < ,
GH6
?
G65G
56 A + 1 ~ K9@>L9MN(E)
32
!" 0 = %", 0 ≤ ( ≤ ) − 1
, = - .
/0"
1
/"!/
!" 2 + 1 ~ 5(678(9:(,)
33
Assuming T = 1
34
()#
(#+(
35
452
4274
– The Boltzmann distribution !(#) = − 1 2 #)*# + # = ,-./ − !(#) – The parameter of the distribution is the weights matrix *
4
426 4
482) =
4
426 4
482) =
“dislike” other states
– “State” == binary pattern of all the neurons
distribution to states
– (vectors !, which we will now calls " because I’m too lazy to normalize the notation) – This should assign more probability to patterns we “like” (or try to memorize) and less to
'()
)
)
4+ ) 4
– Assign higher probability to patterns seen more frequently – Assign lower probability to patterns that are not seen at all
*+,
,
,
7. , 7
log . $ = /
012
302404
2
− log /
67
89: /
012
30240
74 2 7
E log . ! = 1 ) /
6∈!
log . $ = 1 ) /
6
/
012
302404
2
− log /
67
89: /
012
30240
74 2 7
E log % & ≈ 1 ) *
+
*
,-.
/,.0,0
.
− log *
+2
345 *
,-.
/,.0,
20 . 2
6E log % & 6/,. ≈ 1 ) *
+
0,0
. −? ? ?
!log ∑&' ()* ∑+,- .+-/+
'/
!.+- = 1
&'
()* ∑+,- .+-/+
'/
∑&' ()* ∑+,- .+-/+
'/
'/
!log ∑&' ()* ∑+,- .+-/+
'/
!.+- = 1
&'
2(4')/+
'/
– By probabilistically selecting state values according to our model
"#
#( * # ≈ 1
"#∈/01234
#( * #
5log ∑"# :;< ∑)=* >)*()
#( * #
5>)* = !
"#
$(&#)()
#( * #
log $ % = 1 ( )
*
)
+,-
.+-/+/
)
*1∈%34567
89: )
+,-
.+-/+
1/
; log $ % ;.+- = 1 ( )
*
/+/
< )
*1∈%34567
/+
1/
Empirical estimate
' log + , '!"# = 1 . / 1"1
# − 1
3 /
04∈,6789:
1"
41 # 4
' log + , '!"# = 1 . / 1"1
# − 1
3 /
04∈,6789:
1"
41 # 4
state Energy
Note the similarity to the update rule for the Hopfield network
48
49
50
' log + , '!"# = 1 . / 1"1
# − 1
3 /
04∈,6789:
1"
41 # 4
– We want to learn to represent the visible bits – The hidden bits are the “latent” representation learned by the network
– $ = visible bits – & = hidden bits
3
2
"
"
", #"]
/
) − 1
34∈&6789:
40 ) 4
' log + , '!"# = 1 ./ 0
1
2"2
# − 1
3
45∈,789:;
2"
52 # 5
– Which could be repeated to represent relative probabilities
' log + , '!"# = 1 ./ 0
1
2"2
# − 1
3
45∈,789:;
2"
52 # 5
#
#"2" + >"
– [f1, f2, f3, …. , class] – Features can have binarized or continuous valued representations – Classes have “one hot” representation
– Given features, anchor features, estimate a posteriori probability distribution over classes
VISIBLE HIDDEN
VISIBLE HIDDEN
%
%"'" + )"
VISIBLE HIDDEN
%
%"'" + )"
%
%"ℎ" + )"
VISIBLE HIDDEN
"
"
", #"]
1 1 1
VISIBLE HIDDEN
!" = $
%
&
%"'" + )"
*(ℎ" = 1) = 1 1 + /012
1 1 1
VISIBLE HIDDEN
%
%"'" + )"
%
%"ℎ" + )"
" (visible) to training instance value
i j i i i j j j
77
state Energy
1
j i j i ij
i j i j
79
VISIBLE HIDDEN
%
%"'" + )"
%
%"ℎ" + )"
VISIBLE HIDDEN Hidden units may also be continuous values
82