Algorithms for Reasoning with graphical models
Slides Set 11 (part a):
Rina Dechter
slides11a 828X 2019
Sampling Techniques for Probabilistic and Deterministic Graphical models
(Reading” Darwiche chapter 15, related papers)
Slides Set 11 (part a): Sampling Techniques for Probabilistic and - - PowerPoint PPT Presentation
Algorithms for Reasoning with graphical models Slides Set 11 (part a): Sampling Techniques for Probabilistic and Deterministic Graphical models Rina Dechter (Reading Darwiche chapter 15, related papers) slides11a 828X 2019 Sampling
Algorithms for Reasoning with graphical models
slides11a 828X 2019
(Reading” Darwiche chapter 15, related papers)
Reading” Darwiche chapter 15, related papers
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
Sum-Inference Max-Inference Mixed-Inference
– Anytime: very fast & very approximate ! Slower & more accurate
Harder
slides10 828X 2019
– Able to sample from the target distribution p(x)? – Able to evaluate p(x) explicitly, or only up to a constant?
– Unbiased estimator,
– Variance of the estimator decreases with m
slides11a 828X 2019
– p(U) is asymptotically Gaussian:
– If u(x) or its variance are bounded, e.g., probability concentrates rapidly around the expectation:
m=1: m=5: m=15:
slides11a 828X 2019
slides11a 828X 2019
t n t t t
2 1
slides11a 828X 2019
slides11a 828X 2019
) ,..., | ( ... ) | ( ) ( ) (
1 1 1 2 1
n n
X X X P X X P X P X P
slides11a 828X 2019
– Follow variable ordering defined by parents – Starting from root(s), sample downward – When sampling each variable, condition on values of parents
A B C D
Sample:
[e.g., Henrion 1988]
slides11a 828X 2019
1988)
t from P(xi | pai)
slides11a 828X 2019
) (
1
X P
) | (
1 2 X
X P
) | (
1 3 X
X P
) | ( from Sample . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // Evidence No
3 3 , 2 2 4 4 1 1 3 3 1 1 2 2 1 1
x X x X x P x x X x P x x X x P x x P x k
) , | ( ) | ( ) | ( ) ( ) , , , (
3 2 4 1 3 1 2 1 4 3 2 1
X X X P X X P X X P X P X X X X P
) , | (
3 2 4
X X X P
slides11a 828X 2019
t from P(xi | pai)
slides11a 828X 2019
) ( 1 x P
) | (
1 2 x
x P ) , | (
3 2 4
x x x P
) | (
1 3 x
x P
) | ( from Sample 5.
1, from start and sample reject 0, If . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // : Evidence
3 , 2 4 4 3 1 3 3 1 2 2 1 1 3
x x x P x x x x P x x x P x x P x k X
slides11a 828X 2019
x P
2
x P P
slides11a 828X 2019
Many queries can be phrased as computing expectation of some functions
T t t P
1 T 2 1
slides11a 828X 2019
– A distribution P(X) = (0.3, 0.7). – g(X) = 40 if X equals 0 = 50 if X equals 1.
46 10 6 50 4 40 1 50 40 samples X samples X samples g # ) ( # ) ( # ˆ
slides11a 828X 2019
– Draw x ~ p(x), but discard if E e – Resulting samples are from p(x | E=e); use as before – Problem: keeps only P[E=e] fraction of the samples! – Performs poorly when evidence probability is small
– Two estimates (numerator & denominator) – Good finite sample bounds require low relative error! – Again, performs poorly when evidence probability is small – What bounds can we get?
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
absolute slides11a 828X 2019
– Finite sample bounds: u(x) ∈ [0,1] – Relative error bounds [Dagum & Luby 1997] [e.g., Hoeffding]
What if the evidence is unlikely? P[E=e]=1e‐6 ) could estimate U = 0!
slides11a 828X 2019
So, if U, the probability of evidence is very small we would need many many samples Tht are not rejected.
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
“importance weights”
slides11a 828X 2019
slides11a 828X 2019
) ( , ) ( ) ( ˆ : )] ( [ ) ( ) , ( ) ( ) ( ) , ( ) , ( ) ( ) ( ) , ( , \ Z Q z w T e P z w E z Q e z P E z Q z Q e z P e z P e P z Q e z P E X Z Let
t T t t Q Q z z
z where 1 estimate Carlo Monte : as P(e) rewrite can we Then, satisfying
distributi (proposal) a be Q(Z) Let
1
slides11a 828X 2019
T for ) ( ) ( 1 ) ( ˆ
. . 1
e P z w T e P
s a T i i
T z w Var z w T Var e P Var
Q N i i Q Q
)] ( [ ) ( 1 ) ( ˆ
1
slides11a 828X 2019
Q Q Q Q Q Q
2 2
This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)
slides11a 828X 2019
slides11a 828X 2019
) | ( ) | ( E : biased is Estimate , , ) ( ˆ ) , ( ˆ ) | ( : estimate Ratio IS. by r denominato and numerator Estimate : Idea ) ( ) , ( ) ( ) , ( ) ( ) , ( ) , ( ) ( ) ( ) , ( ) | (
and x contains z if 1 is which function, delta
a be (z) Let
T 1 k T 1 k i xi
e x P e x P e) w(z e) )w(z (z e P e x P e x P z Q e z P E z Q e z P z E e z P e z P z e P e x P e x P
i i k k k x i i Q x Q z z x i i
i i i
slides11a 828X 2019
T as ) | ( ) | ( e x P e x P
i i
) | ( )] | ( [ lim e x P e x P E
i i P T
slides11a 828X 2019
possible. as small as be must weights the
variance the Therefore, Q. from samples T) ESS(Q, worth are P from samples T Thus ) , ( )] | ( [ )] | ( ˆ [ )] ( [ var 1 ) , ( : 1 ˆ : using e) | P(x estimate can we e), | P(z from samples Given ) | ( ) ( ) | (
1 i
T Q ESS T e x P Var e x P Var z w T T Q ESS Define ) (z g T |e) (x P e z P z g e x P
i Q i P Q T j t x i z x i
i i
Ideal estimator Measures how much the estimator deviates from the ideal one.
slides11a 828X 2019
slides11a 828X 2019
– Ex: MRF, or BN with evidence – Unbiased; only requires evaluating unnormalized function f(x)
– E.g., marginal probabilities, etc.
Only asymptotically unbiased… Estimate separately
slides11a 828X 2019
– Ex: q(x) puts more probability mass where u(x) is large – Optimal: q(x) ∝ |u(x) p(x)|
– If q(x) << u(x) p(x): rare but very high weights! – Then, empirical variance is also unreliable! – For guarantees, need to analytically bound weights / variance…
How to get a good proposal?
slides11a 828X 2019
(Fung and Chang, 1990; Shachter and Peot, 1990)
Works well for likely evidence!
slides11a 828X 2019
e e e e e e e e e
slides11a 828X 2019
E E j j E X X i i E X X E E j j i i n E X X i i
j i i j i
pa e P e pa x P pa e P e pa x P x Q e x P w x x x Weights x X X X P X P e pa X P E X Q ) | ( ) , | ( ) | ( ) , | ( ) ( ) , ( ) ,.., ( : : ) , | ( ) ( ) , | ( ) \ (
\ \ \
sample a Given ) X , Q(X . x X Evidence and ) X , X | P(X ) X | P(X ) P(X ) X , X , P(X : network Bayesian a Given : Example
1 2 2 1 3 1 3 1 2 2 2 1 3 1 2 1 3 2 1
Notice: Q is another Bayesian network
slides11a 828X 2019
T t t
w T e P
1 ) (
1 ) ( ˆ
zero equals and x if 1 ) ( ) ( ) ( ˆ ) , ( ˆ ) | ( ˆ
i ) ( 1 ) ( ) ( 1 ) ( t i t x T t t t x T t t i i
x x g w x g w e P e x P e x P
i i
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
) | ( ) ( ) ( ) ( ) , ( estimator variance
a have to , ) ( ) ( ) , ( ) ( ) ( ) ( ) , ( 1 )] ( [ ) ( ˆ
2
e z P z Q z Q e P e z P e P z Q e z P z Q e P z Q e z P N T z w Var e P Var
Z z Q Q
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
number of variables
slides11a 828X 2019
Downward message normalizes bucket; ratio is a conditional distribution
E: C: D: B: A:
– Model defines un‐normalized p(A,…,E=e) – Build (oriented) tree decomposition & sample
Z Work: O(exp(w)) to build distribution O(n d) to draw each sample
slides11a 828X 2019
) , ( ) | ( e a P e a P
d e b c
c b e P b a d P a c P a b P a P e a P
, , ,
) , | ( ) , | ( ) | ( ) | ( ) ( ) , (
d c b e
b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (
Elimination Order: d,e,b,c Query:
D: E: B: C: A:
d D
b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE
b E D B
c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P
C
) (a P ) | ( a c P
c B C
c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P
D,A,B E,B,C B,A,C C,A A ) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC
A A D D E E C C B B
Bucket Tree
D E B C A
Original Functions Messages Time and space exp(w*) slides11a 828X 2019
slides11a 828X 2019
b
Elimination operator
bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A
e) (A, hD
(a) hE
e) C, D, (A, hB
e) D, (A, hC
A A D D E E C C B B
slides11a 828X 2019
(Dechter 2002) bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A:
e) (A, hD
(A) hE
e) C, D, (A, hB e) D, (A, hC
Q(A) a A : (A) h P(A) Q(A)
E
Sample ignore : bucket Evidence
e) D, (a, h e) a, | Q(D d D : Sample bucket in the a A Set
C
e) C, d, (a, h ) | ( d) e, a, | Q(C c C : Sample bucket the in d D a, A Set
B
A C P ) , | ( ) , | ( ) | ( d) e, a, | Q(B b B : Sample bucket the in c C d, D a, A Set c b e P a B d P a B P
slides11a 828X 2019
57
bucket A: bucket E: bucket D: bucket C: bucket B:
ΣB
P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C)
Mini-buckets
ΣB
P(C|A) hB(C,e) hD(A) hC(A,e) Approximation of P(e) Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 BE generates a function having scope size 3. So it cannot be used. P(A)
slides11a 828X 2019
58
bucket A: bucket E: bucket D: bucket C: bucket B:
P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C) P(C|A) hB(C,e) hD(A) hC(A,e) Sampling is same as in BE‐sampling except that now we construct Q from a randomly selected “mini‐ bucket”
(Gogate and Dechter, 2005)
slides11a 828X 2019
A B C D E Sampling Order Approx #Solutions (i=2)
CD AD BCD BE DE ABE E
A C E D B
slides11a 828X 2019
– Mini‐buckets – Ijgp – Both
slides11a 828X 2019
slides11a 828X 2019
k ) ( ˆ Re ' ) ( Q Update ) ( N 1 ) ( ˆ e) (E P ˆ Q z ,..., z samples Generate do k to 1 i For ) ( ˆ )) ( | ( ... )) ( | ( ) ( ) ( Q Proposal Initial
1 k 1 N 1 2 2 1 1
e E P turn End Q Q k Q z w e E P from e E P Z pa Z Q Z pa Z Q Z Q Z
k k i N j k k n n
slides11a 828X 2019
1
k j
slides11a 828X 2019
1
1 i 2 2 1 ' n n
slides11a 828X 2019
E: C: D: B: A:
mini‐buckets
U = upper bound
Weighted mixture: use minibucket 1 with probability w1
where Key insight: provides bounded importance weights! [Liu, Fisher, Ihler 2015]
slides11a 828X 2019
101 102 103 104 105 Sample Size (m) 101 102 103 104 105 106 Sample Size (m) BN_6 BN_11 ‐58.4 ‐53 ‐63 ‐39.4 ‐34 ‐44
“Empirical Bernstein” bounds
[Liu, Fisher, Ihler 2015]
slides11a 828X 2019
– BP‐based proposal [Changhe & Druzdzel 2003] – Join‐graph BP proposal [Gogate & Dechter 2005] – Mean field proposal [Wexler & Geiger 2007] E: C: D: B: A:
Join graph:
{B|A,C} {B|D,E} {C|A,E} {D|A,E} {E|A} {A}
{B} {D,E} {A} {A} {A,E} {A,C}
slides11a 828X 2019
– BP‐based proposal [Changhe & Druzdzel 2003] – Join‐graph BP proposal [Gogate & Dechter 2005] – Mean field proposal [Wexler & Geiger 2007]
– Use already‐drawn samples to update q(x) – Rates vt and ´t adapt estimates, proposal – Ex: [Cheng & Druzdzel 2000] [Lapeyre & Boyd 2010] … – Lose “iid”‐ness of samples
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
2
Derived from Chebychev’s Bound.
2
2
N
slides11a 828X 2019
slides11a 828X 2019
slides11a 828X 2019
absolute slides11a 828X 2019
A C E D B F(B,C) F(B,D) F(B,E) F’(B,D,E) F’(B,D,E) F(A,E) F(A,D) F(A,B) F(C,D) F’(C,D,E) F(D,E) F’(D,E) F’(C,D,E) F’(D,E) F’(E) F’(E) D: C: B: E: A:
Sampling Direction
Complexity: Exp (3) or n3
slides11a 828X 2019