Overview
- 1. Probabilistic Reasoning/Graphical models
- 2. Importance Sampling
- 3. Markov Chain Monte Carlo: Gibbs Sampling
- 4. Sampling in presence of Determinism
- 5. Rao-Blackwellisation
- 6. AND/OR importance sampling
Overview 1. Probabilistic Reasoning/Graphical models 2. Importance - - PowerPoint PPT Presentation
Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain Monte Carlo: Gibbs Sampling 4. Sampling in presence of Determinism 5. Rao-Blackwellisation 6. AND/OR importance sampling Overview 1. Probabilistic
– Bayesian network, constraint networks, mixed network
– using inference, – search and hybrids
– tree-width, cycle-cutset, w-cutset
) var( 1
| ) | ( ) (
e X n i e i i pa
x P e P
X i i i C
Z ) (
) var( 1 ) var( 1
| ) | ( | ) | ( ) ( ) , ( ) | (
e X n j e j j X e X n j e j j i i
pa x P pa x P e P e x P e x P
i
x
5
graph is dense; (high treewidth) then:
7
8
t n t t t
2 1
– draw random number r [0, 1] – If (r < 0.3) then set X=0 – Else set X=1
9
10
) ,..., | ( ... ) | ( ) ( ) (
1 1 1 2 1
n n
X X X P X X P X P X P
– Likelihood Sampling – Choosing a Proposal Distribution
– Metropolis-Hastings – Gibbs sampling
Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1. For t = 0 to T 2. For i = 0 to N 3. Xi sample xi
t from P(xi | pai)
12
13
) (
1
X P
) | (
1 2 X
X P
) | (
1 3 X
X P
) | ( from Sample . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // Evidence No
3 3 , 2 2 4 4 1 1 3 3 1 1 2 2 1 1
x X x X x P x x X x P x x X x P x x P x k
) , | ( ) | ( ) | ( ) ( ) , , , (
3 2 4 1 3 1 2 1 4 3 2 1
X X X P X X P X X P X P X X X X P
) , | (
3 2 4
X X X P
Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E
2. For i=1 to N 3. Xi sample xi
t from P(xi | pai)
4. If Xi in E and Xi xi, reject sample: 5. Goto Step 1.
14
15
) ( 1 x P
) | (
1 2 x
x P ) , | (
3 2 4
x x x P
) | (
1 3 x
x P
) | ( from Sample 5.
1, from start and sample reject 0, If . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // : Evidence
3 , 2 4 4 3 1 3 3 1 2 2 1 1 3
x x x P x x x x P x x x P x x P x k X
Expected value: Given a probability distribution P(X) and a function g(X) defined over a set of variables X = {X1, X2, … Xn}, the expected value of g w.r.t. P is Variance: The variance of g w.r.t. P is:
16
x P
2
x P P
– An estimator is a function of the samples. – It produces an estimate of the unknown parameter of the sampling distribution.
17
T t t P
1 T 2 1
– A distribution P(X) = (0.3, 0.7). – g(X) = 40 if X equals 0 = 50 if X equals 1.
18
46 10 6 50 4 40 1 50 40 samples X samples X samples g # ) ( # ) ( # ˆ
19
20
) ( , ) ( ) ( ˆ : )] ( [ ) ( ) , ( ) ( ) ( ) , ( ) , ( ) ( ) ( ) , ( , \ Z Q z w T e P z w E z Q e z P E z Q z Q e z P e z P e P z Q e z P E X Z Let
t T t t Q Q z z
z where 1 estimate Carlo Monte : as P(e) rewrite can we Then, satisfying
distributi (proposal) a be Q(Z) Let
1
T for ) ( ) ( 1 ) ( ˆ
. . 1
e P z w T e P
s a T i i
T z w Var z w T Var e P Var
Q N i i Q Q
)] ( [ ) ( 1 ) ( ˆ
1
Q Q Q Q Q Q
2 2
This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)
24
) | ( ) | ( E : biased is Estimate , , ) ( ˆ ) , ( ˆ ) | ( : estimate Ratio IS. by r denominato and numerator Estimate : Idea ) ( ) , ( ) ( ) , ( ) ( ) , ( ) , ( ) ( ) ( ) , ( ) | (
and x contains z if 1 is which function, delta
a be (z) Let
T 1 k T 1 k i xi
e x P e x P e) w(z e) )w(z (z e P e x P e x P z Q e z P E z Q e z P z E e z P e z P z e P e x P e x P
i i k k k x i i Q x Q z z x i i
i i i
– Harder to analyze – Liu suggests a measure called “Effective sample size”
25
T as ) | ( ) | ( e x P e x P
i i
) | ( )] | ( [ lim e x P e x P E
i i P T
– Q(Z)=Q(Z1)xQ(Z2|Z1)x….xQ(Zn|Z1,..Zn-1)
– Z1Q(Z1)=(0.2,0.8) – Z2 Q(Z2|Z1)=(0.1,0.9,0.2,0.8) – Z3 Q(Z3|Z1,Z2)=Q(Z3)=(0.5,0.5)
27
(Fung and Chang, 1990; Shachter and Peot, 1990)
28
Works well for likely evidence!
“Clamping” evidence+ logic sampling+ weighing samples by evidence likelihood Is an instance of importance sampling!
29
e e e e e e e e e
Sample in topological order over X ! Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!
30
E E j j E X X i i E X X E E j j i i n E X X i i
j i i j i
pa e P e pa x P pa e P e pa x P x Q e x P w x x x Weights x X X X P X P e pa X P E X Q ) | ( ) , | ( ) | ( ) , | ( ) ( ) , ( ) ,.., ( : : ) , | ( ) ( ) , | ( ) \ (
\ \ \
sample a Given ) X , Q(X . x X Evidence and ) X , X | P(X ) X | P(X ) P(X ) X , X , P(X : network Bayesian a Given : Example
1 2 2 1 3 1 3 1 2 2 2 1 3 1 2 1 3 2 1
Notice: Q is another Bayesian network
31
T t t
w T e P
1 ) (
1 ) ( ˆ
Estimate P(e):
zero equals and x if 1 ) ( ) ( ) ( ˆ ) , ( ˆ ) | ( ˆ
i ) ( 1 ) ( ) ( 1 ) ( t i t x T t t t x T t t i i
x x g w x g w e P e x P e x P
i i
Estimate Posterior Marginals:
32
33
absolute
38
) | ( ) ( ) ( ) ( ) , ( estimator variance
a have to , ) ( ) ( ) , ( ) ( ) ( ) ( ) , ( 1 )] ( [ ) ( ˆ
2
e z P z Q z Q e P e z P e P z Q e z P z Q e P z Q e z P N T z w Var e P Var
Z z Q Q
– Run Bucket elimination on the problem along an
– Sample along the reverse ordering: (X1,..,XN) – At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.
41
) , ( ) | ( e a P e a P
d e b c
c b e P b a d P a c P a b P a P e a P
, , ,
) , | ( ) , | ( ) | ( ) | ( ) ( ) , (
d c b e
b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (
Elimination Order: d,e,b,c Query:
D: E: B: C: A:
d D
b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE
b E D B
c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P
C
) (a P ) | ( a c P
c B C
c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P
D,A,B E,B,C B,A,C C,A A
) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC
A A D D E E C C B B
Bucket Tree
D E B C A
Original Functions Messages Time and space exp(w*)
SP2 42
Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(e)
bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A
e) (A, hD
(a) hE
e) C, D, (A, hB
e) D, (A, hC
SP2 43
(Dechter 2002) bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A:
e) (A, hD
(A) hE
e) C, D, (A, hB
e) D, (A, hC
Q(A) a A : (A) h P(A) Q(A)
E
Sample ignore : bucket Evidence
e) D, (a, h e) a, | Q(D d D : Sample bucket in the a A Set
C
e) C, d, (a, h ) | ( d) e, a, | Q(C c C : Sample bucket the in d D a, A Set
B
A C P
) , | ( ) , | ( ) | ( d) e, a, | Q(B b B : Sample bucket the in c C d, D a, A Set c b e P a B d P a B P
SP2 44
SP2 45 45
bucket A: bucket E: bucket D: bucket C: bucket B:
ΣB
P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C)
Mini-buckets
ΣB
P(C|A) hB(C,e) hD(A) hC(A,e) Approximation of P(e) Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 BE generates a function having scope size 3. So it cannot be used. P(A)
SP2 46 46
bucket A: bucket E: bucket D: bucket C: bucket B:
P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C) P(C|A) hB(C,e) hD(A) hC(A,e) Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini- bucket”
– A Generalized Belief Propagation scheme (Yedidia et al., 2002)
– (Dechter, Kask and Mateescu, 2002)
– Mini-buckets – Ijgp – Both
– Some assignments generated are non solutions
k ) ( ˆ Re ' ) ( Q Update ) ( N 1 ) ( ˆ e) (E P ˆ Q z ,..., z samples Generate do k to 1 i For ) ( ˆ )) ( | ( ... )) ( | ( ) ( ) ( Q Proposal Initial
1 k 1 N 1 2 2 1 1
e E P turn End Q Q k Q z w e E P from e E P Z pa Z Q Z pa Z Q Z Q Z
k k i N j k k n n
1
k j
1
1 i 2 2 1 ' n n
the property that the next state depends only on the current state (Markov Property):
54
x1 x2 x3 x4
1 1 2 1
t t t t
homogeneous) and state space is finite, then it is
transition matrix)
x
55
1 2 3
transition matrix P(X) 1 2
56
transition matrix P(X)
rain rain rain rain sun
57
x1
t
x2
t
x3
t
x1
t+1
x2
t+1
x3
t+1
2 1 t n t t t
i
3 2 1
X1
t
X2
t
X3
t
3 2 1 t t t t
X1 X2 X3
58
3 2 1
x1
t+1
x2
t+1
x3
t+1
59
– The chain is irreducible – All of its states are positive recurrent
) (
X D x j i j i
i
60
(Liu, Ch. 12, pp. 249, Def. 12.1.1)
61
The recurrent states in a finite state chain are positive recurrent .
n n n
) (
) (
n n
62
63
64
visited states x0,…,xn can be viewed as “samples” from distribution
1 t T t
T
65
66
1 1 1 i t i t n t i t i t i i
The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk): In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).
68
1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1
i t i t i i t N t t N t N N t N t t t t N t t t
Process All Variables In Some Order
Markov blanket:
: ) | ( ) \ | (
t i i i t i
markov X P x x X P
i j ch
X j j i i i t i
pa x P pa x P x x x P ) | ( ) | ( ) \ | ( ) ( ) (
j j ch
X j i i i
pa ch pa X markov
Xi
Given Markov blanket (parents, children, and their parents), Xi is independent of all other nodes
69
Computation is linear in the size of Markov blanket!
1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. xi
t+1 P(Xi | markovi t)
4. End For 5. End For
71 X1 X4 X8 X5 X2 X3 X9 X7 X6
} { }, ,..., , {
9 9 2 1
X E X X X X
X1 = x1 X6 = x6 X2 = x2 X7 = x7 X3 = x3 X8 = x8 X4 = x4 X5 = x5
72 X1 X4 X8 X5 X2 X3 X9 X7 X6
9 8 2 1 1 1
} { }, ,..., , {
9 9 2 1
X E X X X X
9 8 1 1 2 1 2
T t t i i i i i
1
T t t i i i
1
Dirac delta f-n
estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)
Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies
74
for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).
T t t t t t T t t T t
1 1
wt
76
– if Xi and Xj are strongly correlated, Xi=0 Xj=0, then, we cannot explore states with Xi=1 and Xj=1
77
size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: + Can improve convergence greatly when two variables are strongly correlated!
the #variables in a block!
78
) | , ( ) , ( ) ( ) , | (
1 1 1 1 1
t t t t t t t t
x Z Y P w z y w P z y X P x
79
M i m
1
K t i t i i m
1
(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)
which converges to
– Blocking – Rao-Blackwellised
80
– Reduce dependence between samples
– Reduce variance
82
83
4 3 2 1
2 1 X
84
00 01 10 11 0.1 0.2 00 01 10 11
P(X1,X2,X3,X4)
0-0.1 0.1-0.2 0.2-0.26 1 0.1 0.2 1
P(X1,X2)
0-0.1 0.1-0.2 0.2-0.26
P Var BIAS P MSE
Q Q
2
2 2
Q Q Q Q
85
86
)} ( ~ { ]} | ) ( [ { )} ( { )} ( ˆ { ]} | ) ( [ { )} ( { ]} | ) ( {var[ ]} | ) ( [ { )} ( { ]} | ) ( [ ] | ) ( [ { 1 ) ( ~ )} ( ) ( { 1 ) ( ˆ
1 1
x g Var T l x h E Var T x h Var x g Var l x g E Var x g Var l x g E l x g E Var x g Var l x h E l x h E T x g x h x h T x g L R X
T T
Liu, Ch.2.3
– autocovariances are lower (less correlation between samples) – if Xi and Xj are strongly correlated, Xi=0 Xj=0,
87
} ) ( ) ( { } ) , ( ) , ( { R Q R P Var L R Q L R P Var
Q Q
Liu, Ch.2.5.5 “Carry out analytical computation as much as possible” - Liu
(1)
(2)
(3)
88
X Y Z
Faster Convergence
Generating Samples
89
1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1
i t i t i i t K t t K t K K t K t t t t K t t t
1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. ci
t+1 P(Ci | ct\ci)
4. End For 5. End For
– generate more samples with higher covariance – generate fewer samples with lower covariance
91
92
93
} { }, {
9 5 2
X E ,x x c
X1 X7 X5 X4 X2 X9 X8 X3 X6 X1 X7 X4 X9 X8 X3 X6
P(x2,x5,x9) – can compute using Bucket Elimination P(x2,x5,x9) – computation complexity is O(N)
94
) , , 1 ( : ) , , ( :
9 3 2 9 3 2
x x x P BE x x x P BE
X1 X7 X5 X4 X2 X9 X8 X3 X6
) , , 1 ( ) | 1 ( ) , , ( ) | ( ) , , 1 ( ) , , (
9 3 2 3 2 9 3 2 3 2 9 3 2 9 3 2
x x x P x x P x x x P x x P x x x P x x x P
95
computed while generating sample t using bucket tree elimination compute after generating sample t using bucket tree elimination
T t i t i i
1
T t t i i
1
96
) ( ) ( 1
C D c i C D c i T t t i i
) (
C D c i i
97
) ( ) ( ) ( 3 1 ) | ( ) ( ) ( ) (
9 2 5 2 9 1 5 2 9 5 2 9 2 9 2 5 2 3 2 9 1 5 2 2 2 9 5 2 1 2
,x | x x P ,x | x x P ,x | x x P x x P ,x | x x P x ,x | x x P x ,x | x x P x
X1 X7 X6 X5 X4 X2 X9 X8 X3
Sample 1 Sample 2 Sample 3
98
) , , | ( } , { ) , , | ( } , { ) , , | ( } , {
9 3 5 3 2 3 3 5 3 2 3 9 2 5 2 2 3 2 5 2 2 2 9 1 5 1 2 3 1 5 1 2 1
x x x x P x x c x x x x P x x c x x x x P x x c
X1 X7 X6 X5 X4 X2 X9 X8 X3
) , , | ( ) , , | ( ) , , | ( 3 1 ) | (
9 3 5 3 2 3 9 2 5 2 2 3 9 1 5 1 2 3 9 3
x x x x P x x x x P x x x x P x x P
99
MSE vs. #samples (left) and time (right) Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3 Exact Time = 30 sec using Cutset Conditioning
CPCS54, n=54, |C|=15, |E|=3 0.001 0.002 0.003 0.004 1000 2000 3000 4000 5000 # samples
Cutset Gibbs
CPCS54, n=54, |C|=15, |E|=3
0.0002 0.0004 0.0006 0.0008 5 10 15 20 25 Time(sec)
Cutset Gibbs
100
MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35 Exact Time = 122 sec using Cutset Conditioning
CPCS179, n=179, |C|=8, |E|=35
0.002 0.004 0.006 0.008 0.01 0.012 100 500 1000 2000 3000 4000 # samples Cutset Gibbs
CPCS179, n=179, |C|=8, |E|=35
0.002 0.004 0.006 0.008 0.01 0.012 20 40 60 80 Time(sec)
Cutset Gibbs
101
MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination
CPCS360b, n=360, |C|=21, |E|=36
0.00004 0.00008 0.00012 0.00016 200 400 600 800 1000 # samples
Cutset Gibbs
CPCS360b, n=360, |C|=21, |E|=36
0.00004 0.00008 0.00012 0.00016 1 2 3 5 10 20 30 40 50 60 Time(sec)
Cutset Gibbs
102
MSE vs. #samples (left) and time (right) |X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20 Exact Time = 30 sec using Cutset Conditioning
RANDOM, n=100, |C|=13, |E|=15-20
0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 200 400 600 800 1000 1200
# samples Cutset Gibbs
RANDOM, n=100, |C|=13, |E|=15-20
0.0002 0.0004 0.0006 0.0008 0.001 1 2 3 4 5 6 7 8 9 10 11 Time(sec) Cutset Gibbs
Cutset Transforms Non-Ergodic Chain to Ergodic
103
MSE vs. time (right) Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50 Sample Ergodic Subspace U={U1, U2,…Uk} Exact Time = 50 sec using Cutset Conditioning
x1 x2 x3 x4 u1 u2 u3 u4 p1 p2 p3 p4 y4 y3 y2 y1
Coding Networks, n=100, |C|=12-14
0.001 0.01 0.1 10 20 30 40 50 60 Time(sec)
IBP Gibbs Cutset
104
MSE vs. #samples (left) and time (right) Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0 Exact Time = 2 sec using Loop-Cutset Conditioning
HailFinder, n=56, |C|=5, |E|=1
0.0001 0.001 0.01 0.1 1 1 2 3 4 5 6 7 8 9 10
Time(sec)
Cutset Gibbs
HailFinder, n=56, |C|=5, |E|=1
0.0001 0.001 0.01 0.1 500 1000 1500
# samples Cutset Gibbs
105
cpcs360b, N=360, |E|=[20-34], w*=20, MSE
0.000005 0.00001 0.000015 0.00002 0.000025
200 400 600 800 1000 1200 1400 1600 Time (sec)
Gibbs IBP |C|=26,fw=3 |C|=48,fw=2
MSE vs. Time Ergodic, |X| = 360, |C| = 26, D(Xi)=2 Exact Time = 50 min using BTE
T t t T t t t
w T c Q e c P T e P
1 1
1 ) ( ) , ( 1 ) ( ˆ
T t t t i i
w c c T e c P
1
) , ( 1 ) | (
T t t t i i
w e c x P T e x P
1
) , | ( 1 ) | (
where P(ct,e) is computed using Bucket Elimination
(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)
107
For End If End ) ,..., | ( Else If : do For
1 1 1 1 1
1
t i t i t i i i t i i i
z z Z P z e ,z z z E Z Z Z
sample t using bucket tree elimination
number of instances K (based on memory available
KL[P(C|e), Q(C)+ ≤ KL*P(X|e), Q(X)]
108
109
110
zero-weight
slower
presence of constraints
111
Importance Sampling Gibbs Sampling
112
cpcs360b, N=360, |LC|=26, w*=21, |E|=15
1.E-05 1.E-04 1.E-03 1.E-02 2 4 6 8 10 12 14
Time (sec) MSE
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset
113 1.0E-05 1.0E-04 1.0E-03 1.0E-02 10 20 30 40 50 60
MSE Time (sec)
cpcs422b, N=422, |LC|=47, w*=22, |E|=28
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset
114
coding, N=200, P=3, |LC|=26, w*=21
1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 2 4 6 8 10
Time (sec) MSE
LW AIS-BN Gibbs LCS IBP
LW – likelihood weighting LCS – likelihood weighting on a cutset