Sampling Techniques for Probabilistic and Deterministic Graphical models
ICS 276, Spring 2017 Bozhena Bidyuk Rina Dechter
Reading” Darwiche chapter 15, related papers
Sampling Techniques for Probabilistic and Deterministic Graphical - - PowerPoint PPT Presentation
Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017 Bozhena Bidyuk Rina Dechter Reading Darwiche chapter 15, related papers Overview 1. Probabilistic Reasoning/Graphical models 2. Importance
Reading” Darwiche chapter 15, related papers
the property that the next state depends only on the current state (Markov Property):
3
x1 x2 x3 x4
1 1 2 1
t t t t
homogeneous) and state space is finite, then it is
transition matrix)
x
4
1 2 3
transition matrix P(X) 1 2
5
transition matrix P(X)
rain rain rain rain sun
6
x1
t
x2
t
x3
t
x1
t+1
x2
t+1
x3
t+1
2 1 t n t t t
i
3 2 1
X1
t
X2
t
X3
t
3 2 1 t t t t
X1 X2 X3
7
3 2 1
x1
t+1
x2
t+1
x3
t+1
8
– The chain is irreducible – All of its states are positive recurrent
) (
X D x j i j i
i
9
(Liu, Ch. 12, pp. 249, Def. 12.1.1)
10
The recurrent states in a finite state chain are positive recurrent .
n n n
) (
) (
n n
11
12
13
visited states x0,…,xn can be viewed as “samples” from distribution
1 t T t
T
14
15
1 1 1 i t i t n t i t i t i i
The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk): In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).
17
1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1
i t i t i i t N t t N t N N t N t t t t N t t t
Process All Variables In Some Order
Markov blanket:
: ) | ( ) \ | (
t i i i t i
markov X P x x X P
i j ch
X j j i i i t i
pa x P pa x P x x x P ) | ( ) | ( ) \ | ( ) ( ) (
j j ch
X j i i i
pa ch pa X markov
Xi
Given Markov blanket (parents, children, and their parents), Xi is independent of all other nodes
18
Computation is linear in the size of Markov blanket!
1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. xi
t+1 P(Xi | markovi t)
4. End For 5. End For
20 X1 X4 X8 X5 X2 X3 X9 X7 X6
} { }, ,..., , {
9 9 2 1
X E X X X X
X1 = x1 X6 = x6 X2 = x2 X7 = x7 X3 = x3 X8 = x8 X4 = x4 X5 = x5
21 X1 X4 X8 X5 X2 X3 X9 X7 X6
9 8 2 1 1 1
} { }, ,..., , {
9 9 2 1
X E X X X X
9 8 1 1 2 1 2
T t t i i i i i
1
T t t i i i
1
Dirac delta f-n
estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)
Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies
23
)] ( [ } | ) ( { [ R g Var L R g E Var
for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).
T t t t t t T t t T t
1 1
wt
25
– if Xi and Xj are strongly correlated, Xi=0 Xj=0, then, we cannot explore states with Xi=1 and Xj=1
26
size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: + Can improve convergence greatly when two variables are strongly correlated!
the #variables in a block!
27
1 1 1 1 1
t t t t t t t t
28
M i m
1
K t i t i i m
1
(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)
which converges to
– Blocking – Rao-Blackwellised
29
) | ( e X P ) | ( e X P
– Reduce dependence between samples
– Reduce variance
31
32
} , , , {
4 3 2 1
X X X X X } , {
2 1 X
X X 64 ) ( X D 16 ) ( X D
33
00 01 10 11 0.1 0.2 00 01 10 11
P(X1,X2,X3,X4)
0-0.1 0.1-0.2 0.2-0.26 1 0.1 0.2 1
P(X1,X2)
0-0.1 0.1-0.2 0.2-0.26
P Var BIAS P MSE
Q Q
2
2 2
] [ ˆ ] ˆ [ ] ˆ [ P E P E P Var P MSE
Q Q Q Q
34
35
)} ( ~ { ]} | ) ( [ { )} ( { )} ( ˆ { ]} | ) ( [ { )} ( { ]} | ) ( {var[ ]} | ) ( [ { )} ( { ]} | ) ( [ ] | ) ( [ { 1 ) ( ~ )} ( ) ( { 1 ) ( ˆ
1 1
x g Var T l x h E Var T x h Var x g Var l x g E Var x g Var l x g E l x g E Var x g Var l x h E l x h E T x g x h x h T x g L R X
T T
Liu, Ch.2.3
– autocovariances are lower (less correlation between samples) – if Xi and Xj are strongly correlated, Xi=0 Xj=0,
36
} ) ( ) ( { } ) , ( ) , ( { R Q R P Var L R Q L R P Var
Q Q
Liu, Ch.2.5.5 “Carry out analytical computation as much as possible” - Liu
(1)
(2)
(3)
37
X Y Z
) , | ( ), , | ( ), , | ( y x z P z x y P z y x P ) | , ( ), , | ( x z y P z y x P ) | ( ), | ( x y P y x P
Faster Convergence
Generating Samples
38
1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1
i t i t i i t K t t K t K K t K t t t t K t t t
1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. ci
t+1 P(Ci | ct\ci)
4. End For 5. End For
– generate more samples with higher covariance – generate fewer samples with lower covariance
40
41
42
} { }, {
9 5 2
X E ,x x c
X1 X7 X5 X4 X2 X9 X8 X3 X6 X1 X7 X4 X9 X8 X3 X6
P(x2,x5,x9) – can compute using Bucket Elimination P(x2,x5,x9) – computation complexity is O(N)
43
) , , 1 ( : ) , , ( :
9 3 2 9 3 2
x x x P BE x x x P BE
X1 X7 X5 X4 X2 X9 X8 X3 X6
) , , 1 ( ) | 1 ( ) , , ( ) | ( ) , , 1 ( ) , , (
9 3 2 3 2 9 3 2 3 2 9 3 2 9 3 2
x x x P x x P x x x P x x P x x x P x x x P
44
computed while generating sample t using bucket tree elimination compute after generating sample t using bucket tree elimination
T t i t i i
1
T t t i i
1
45
) ( ) ( 1
C D c i C D c i T t t i i
) (
C D c i i
46
) ( ) ( ) ( 3 1 ) | ( ) ( ) ( ) (
9 2 5 2 9 1 5 2 9 5 2 9 2 9 2 5 2 3 2 9 1 5 2 2 2 9 5 2 1 2
,x | x x P ,x | x x P ,x | x x P x x P ,x | x x P x ,x | x x P x ,x | x x P x
X1 X7 X6 X5 X4 X2 X9 X8 X3
Sample 1 Sample 2 Sample 3
47
) , , | ( } , { ) , , | ( } , { ) , , | ( } , {
9 3 5 3 2 3 3 5 3 2 3 9 2 5 2 2 3 2 5 2 2 2 9 1 5 1 2 3 1 5 1 2 1
x x x x P x x c x x x x P x x c x x x x P x x c
X1 X7 X6 X5 X4 X2 X9 X8 X3
) , , | ( ) , , | ( ) , , | ( 3 1 ) | (
9 3 5 3 2 3 9 2 5 2 2 3 9 1 5 1 2 3 9 3
x x x x P x x x x P x x x x P x x P
48
MSE vs. #samples (left) and time (right) Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3 Exact Time = 30 sec using Cutset Conditioning
CPCS54, n=54, |C|=15, |E|=3 0.001 0.002 0.003 0.004 1000 2000 3000 4000 5000 # samples
Cutset Gibbs
CPCS54, n=54, |C|=15, |E|=3
0.0002 0.0004 0.0006 0.0008 5 10 15 20 25 Time(sec)
Cutset Gibbs
49
MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35 Exact Time = 122 sec using Cutset Conditioning
CPCS179, n=179, |C|=8, |E|=35
0.002 0.004 0.006 0.008 0.01 0.012 100 500 1000 2000 3000 4000 # samples Cutset Gibbs
CPCS179, n=179, |C|=8, |E|=35
0.002 0.004 0.006 0.008 0.01 0.012 20 40 60 80 Time(sec)
Cutset Gibbs
50
MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination
CPCS360b, n=360, |C|=21, |E|=36
0.00004 0.00008 0.00012 0.00016 200 400 600 800 1000 # samples
Cutset Gibbs
CPCS360b, n=360, |C|=21, |E|=36
0.00004 0.00008 0.00012 0.00016 1 2 3 5 10 20 30 40 50 60 Time(sec)
Cutset Gibbs
51
MSE vs. #samples (left) and time (right) |X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20 Exact Time = 30 sec using Cutset Conditioning
RANDOM, n=100, |C|=13, |E|=15-20
0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 200 400 600 800 1000 1200
# samples Cutset Gibbs
RANDOM, n=100, |C|=13, |E|=15-20
0.0002 0.0004 0.0006 0.0008 0.001 1 2 3 4 5 6 7 8 9 10 11 Time(sec) Cutset Gibbs
Cutset Transforms Non-Ergodic Chain to Ergodic
52
MSE vs. time (right) Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50 Sample Ergodic Subspace U={U1, U2,…Uk} Exact Time = 50 sec using Cutset Conditioning
x1 x2 x3 x4 u1 u2 u3 u4 p1 p2 p3 p4 y4 y3 y2 y1
Coding Networks, n=100, |C|=12-14
0.001 0.01 0.1 10 20 30 40 50 60 Time(sec)
IBP Gibbs Cutset
53
MSE vs. #samples (left) and time (right) Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0 Exact Time = 2 sec using Loop-Cutset Conditioning
HailFinder, n=56, |C|=5, |E|=1
0.0001 0.001 0.01 0.1 1 1 2 3 4 5 6 7 8 9 10
Time(sec)
Cutset Gibbs
HailFinder, n=56, |C|=5, |E|=1
0.0001 0.001 0.01 0.1 500 1000 1500
# samples Cutset Gibbs
54
cpcs360b, N=360, |E|=[20-34], w*=20, MSE
0.000005 0.00001 0.000015 0.00002 0.000025
200 400 600 800 1000 1200 1400 1600 Time (sec)
Gibbs IBP |C|=26,fw=3 |C|=48,fw=2
MSE vs. Time Ergodic, |X| = 360, |C| = 26, D(Xi)=2 Exact Time = 50 min using BTE
T t t T t t t
w T c Q e c P T e P
1 1
1 ) ( ) , ( 1 ) ( ˆ
T t t t i i
w c c T e c P
1
) , ( 1 ) | (
T t t t i i
w e c x P T e x P
1
) , | ( 1 ) | (
where P(ct,e) is computed using Bucket Elimination
(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)
56
For End If End ) ,..., | ( Else If : do For
1 1 1 1 1
1
t i t i t i i i t i i i
z z Z P z e ,z z z E Z Z Z
sample t using bucket tree elimination
number of instances K (based on memory available
KL[P(C|e), Q(C)] ≤ KL[P(X|e), Q(X)]
57
58
59
zero-weight
slower
presence of constraints
60
Importance Sampling Gibbs Sampling
61
cpcs360b, N=360, |LC|=26, w*=21, |E|=15
1.E-05 1.E-04 1.E-03 1.E-02 2 4 6 8 10 12 14
Time (sec) MSE
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset
62 1.0E-05 1.0E-04 1.0E-03 1.0E-02 10 20 30 40 50 60
MSE Time (sec)
cpcs422b, N=422, |LC|=47, w*=22, |E|=28
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset
63
coding, N=200, P=3, |LC|=26, w*=21
1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 2 4 6 8 10
Time (sec) MSE
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset