Slides Set 11 (part a): Sampling Techniques for Probabilistic and - - PowerPoint PPT Presentation

slides set 11 part a
SMART_READER_LITE
LIVE PREVIEW

Slides Set 11 (part a): Sampling Techniques for Probabilistic and - - PowerPoint PPT Presentation

Algorithms for Reasoning with graphical models Slides Set 11 (part a): Sampling Techniques for Probabilistic and Deterministic Graphical models Rina Dechter (Reading Darwiche chapter 15, related papers) slides11a 828X 2019 Sampling


slide-1
SLIDE 1

Algorithms for Reasoning with graphical models

Slides Set 11 (part a):

Rina Dechter

slides11a 828X 2019

Sampling Techniques for Probabilistic and Deterministic Graphical models

(Reading” Darwiche chapter 15, related papers)

slide-2
SLIDE 2

Sampling Techniques for Probabilistic and Deterministic Graphical models

ICS 276, Spring 2018 Bozhena Bidyuk

Reading” Darwiche chapter 15, related papers

slides11a 828X 2019

slide-3
SLIDE 3

Overview

  • 1. Basics of sampling
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao‐Blackwellisation, cutset sampling

slides11a 828X 2019

slide-4
SLIDE 4

Overview

  • 1. Basics of sampling
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao‐Blackwellisation, cutset sampling

slides11a 828X 2019

slide-5
SLIDE 5

 Sum-Inference  Max-Inference  Mixed-Inference

Types of queries

  • NP-hard: exponentially many terms
  • We will focus on approximation algorithms

– Anytime: very fast & very approximate ! Slower & more accurate

Harder

slides10 828X 2019

slide-6
SLIDE 6

Monte Carlo estimators

  • Most basic form: empirical estimate of probability
  • Relevant considerations

– Able to sample from the target distribution p(x)? – Able to evaluate p(x) explicitly, or only up to a constant?

  • “Any‐time” properties

– Unbiased estimator,

  • r asymptotically unbiased,

– Variance of the estimator decreases with m

slides11a 828X 2019

slide-7
SLIDE 7

Monte Carlo estimators

  • Most basic form: empirical estimate of probability
  • Central limit theorem

– p(U) is asymptotically Gaussian:

  • Finite sample confidence intervals

– If u(x) or its variance are bounded, e.g., probability concentrates rapidly around the expectation:

m=1: m=5: m=15:

slides11a 828X 2019

slide-8
SLIDE 8

slides11a 828X 2019

slide-9
SLIDE 9
slide-10
SLIDE 10

A Sample

  • Given a set of variables X={X1,...,Xn}, a sample,

denoted by St is an instantiation of all variables:

) ,..., , (

t n t t t

x x x S

2 1

slides11a 828X 2019

slide-11
SLIDE 11

How to Draw a Sample ? Univariate Distribution

  • Example: Given random variable X having

domain {0, 1} and a distribution P(X) = (0.3, 0.7).

  • Task: Generate samples of X from P.
  • How?

– draw random number r  [0, 1] – If (r < 0.3) then set X=0 – Else set X=1

slides11a 828X 2019

slide-12
SLIDE 12

How to Draw a Sample? Multi‐Variate Distribution

  • Let X={X1,..,Xn} be a set of variables
  • Express the distribution in product form
  • Sample variables one by one from left to right,

along the ordering dictated by the product form.

  • Bayesian network literature: Logic sampling or

Forward Sampling.

) ,..., | ( ... ) | ( ) ( ) (

1 1 1 2 1 

   

n n

X X X P X X P X P X P

slides11a 828X 2019

slide-13
SLIDE 13

Sampling in Bayes nets (Forward Sampling)

  • No evidence: “causal” form makes sampling easy

– Follow variable ordering defined by parents – Starting from root(s), sample downward – When sampling each variable, condition on values of parents

A B C D

Sample:

[e.g., Henrion 1988]

slides11a 828X 2019

slide-14
SLIDE 14

Froward Sampling: No Evidence (Henrion

1988)

Input: Bayesian network X= {X1,…,XN}, N‐ #nodes, T ‐ # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1. For t = 0 to T 2. For i = 0 to N 3. Xi  sample xi

t from P(xi | pai)

slides11a 828X 2019

slide-15
SLIDE 15

Forward Sampling (example)

X1 X4 X2 X3

) (

1

X P

) | (

1 2 X

X P

) | (

1 3 X

X P

) | ( from Sample . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // Evidence No

3 3 , 2 2 4 4 1 1 3 3 1 1 2 2 1 1

x X x X x P x x X x P x x X x P x x P x k    

) , | ( ) | ( ) | ( ) ( ) , , , (

3 2 4 1 3 1 2 1 4 3 2 1

X X X P X X P X X P X P X X X X P    

) , | (

3 2 4

X X X P

slides11a 828X 2019

slide-16
SLIDE 16

Forward Sampling w/ Evidence

Input: Bayesian network X= {X1,…,XN}, N‐ #nodes E – evidence, T ‐ # samples Output: T samples consistent with E

  • 1. For t=1 to T

2. For i=1 to N 3. Xi  sample xi

t from P(xi | pai)

4. If Xi in E and Xi  xi, reject sample: 5. Goto Step 1.

slides11a 828X 2019

slide-17
SLIDE 17

Forward Sampling (example)

X1 X4 X2 X3

) ( 1 x P

) | (

1 2 x

x P ) , | (

3 2 4

x x x P

) | (

1 3 x

x P

) | ( from Sample 5.

  • therwise

1, from start and sample reject 0, If . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // : Evidence

3 , 2 4 4 3 1 3 3 1 2 2 1 1 3

x x x P x x x x P x x x P x x P x k X  

slides11a 828X 2019

slide-18
SLIDE 18

Expected value: Given a probability distribution P(X) and a function g(X) defined over a set of variables X = {X1, X2, … Xn}, the expected value of g w.r.t. P is Variance: The variance of g w.r.t. P is:

How to answer queries with sampling? Expected value and Variance

) ( ) ( )] ( [ x P x g x g E

x P

 

) ( )] ( [ ) ( )] ( [

2

x P x g E x g x g Var

x P P

 

slides11a 828X 2019

Many queries can be phrased as computing expectation of some functions

slide-19
SLIDE 19

Monte Carlo Estimate

  • Estimator:

– An estimator is a function of the samples. – It produces an estimate of the unknown parameter of the sampling distribution.

 

 

T t t P

S g T g P

1 T 2 1

1 : by given is [g(x)] E

  • f

estimate carlo Monte the , from drawn S , S , S samples i.i.d. Given ) ( ˆ

slides11a 828X 2019

slide-20
SLIDE 20

Example: Monte Carlo estimate

  • Given:

– A distribution P(X) = (0.3, 0.7). – g(X) = 40 if X equals 0 = 50 if X equals 1.

  • Estimate EP[g(x)]=(40x0.3+50x0.7)=47.
  • Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

46 10 6 50 4 40 1 50 40            samples X samples X samples g # ) ( # ) ( # ˆ

slides11a 828X 2019

slide-21
SLIDE 21

Bayes Nets with Evidence

  • Estimating posterior probabilities, P[A = a | E=e]?
  • Rejection sampling

– Draw x ~ p(x), but discard if E e – Resulting samples are from p(x | E=e); use as before – Problem: keeps only P[E=e] fraction of the samples! – Performs poorly when evidence probability is small

  • Estimate the ratio: P[A=a,E=e] / P[E=e]

– Two estimates (numerator & denominator) – Good finite sample bounds require low relative error! – Again, performs poorly when evidence probability is small – What bounds can we get?

slides11a 828X 2019

slide-22
SLIDE 22

slides11a 828X 2019

slide-23
SLIDE 23

slides11a 828X 2019

slide-24
SLIDE 24

slides11a 828X 2019

slide-25
SLIDE 25

absolute slides11a 828X 2019

slide-26
SLIDE 26

Bayes Nets With Evidence

  • Estimating the probability of evidence, P[E=e]:

– Finite sample bounds: u(x) ∈ [0,1] – Relative error bounds [Dagum & Luby 1997] [e.g., Hoeffding]

What if the evidence is unlikely? P[E=e]=1e‐6 ) could estimate U = 0!

slides11a 828X 2019

So, if U, the probability of evidence is very small we would need many many samples Tht are not rejected.

slide-27
SLIDE 27

Overview

  • 1. Basics of sampling
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao‐Blackwellisation, cutset sampling

slides11a 828X 2019

slide-28
SLIDE 28

Importance Sampling: Main Idea

  • Express query as the expected value of a

random variable w.r.t. to a distribution Q.

  • Generate random samples from Q.
  • Estimate the expected value from the

generated samples using a monte carlo estimator (average).

slides11a 828X 2019

slide-29
SLIDE 29

Importance Sampling

  • Basic empirical estimate of probability:
  • Importance sampling:

slides11a 828X 2019

slide-30
SLIDE 30

Importance Sampling

  • Basic empirical estimate of probability:
  • Importance sampling:

“importance weights”

slides11a 828X 2019

slide-31
SLIDE 31

slides11a 828X 2019

Estimating P(E) and P(X|e)

slide-32
SLIDE 32

Importance Sampling For P(e)

) ( , ) ( ) ( ˆ : )] ( [ ) ( ) , ( ) ( ) ( ) , ( ) , ( ) ( ) ( ) , ( , \ Z Q z w T e P z w E z Q e z P E z Q z Q e z P e z P e P z Q e z P E X Z Let

t T t t Q Q z z

               

  

z where 1 estimate Carlo Monte : as P(e) rewrite can we Then, satisfying

  • n,

distributi (proposal) a be Q(Z) Let

1

slides11a 828X 2019

slide-33
SLIDE 33

Properties of IS Estimate of P(e)

  • Convergence: by law of large numbers
  • Unbiased.
  • Variance:

    

 

T for ) ( ) ( 1 ) ( ˆ

. . 1

e P z w T e P

s a T i i

 

T z w Var z w T Var e P Var

Q N i i Q Q

)] ( [ ) ( 1 ) ( ˆ

1

       

 

) ( )] ( ˆ [ e P e P EQ 

slides11a 828X 2019

slide-34
SLIDE 34

Properties of IS Estimate of P(e)

  • Mean Squared Error of the estimator

 

   

   

T x w Var e P Var e P Var e P E e P e P e P E e P MSE

Q Q Q Q Q Q

)] ( [ ) ( ˆ ) ( ˆ )] ( ˆ [ ) ( ) ( ) ( ˆ ) ( ˆ             

2 2

This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)

slides11a 828X 2019

slide-35
SLIDE 35

slides11a 828X 2019

Estimating P(E) and P(X|e)

slide-36
SLIDE 36

Estimating P(Xi|e)

 

) | ( ) | ( E : biased is Estimate , , ) ( ˆ ) , ( ˆ ) | ( : estimate Ratio IS. by r denominato and numerator Estimate : Idea ) ( ) , ( ) ( ) , ( ) ( ) , ( ) , ( ) ( ) ( ) , ( ) | (

  • therwise.

and x contains z if 1 is which function, delta

  • dirac

a be (z) Let

T 1 k T 1 k i xi

e x P e x P e) w(z e) )w(z (z e P e x P e x P z Q e z P E z Q e z P z E e z P e z P z e P e x P e x P

i i k k k x i i Q x Q z z x i i

i i i

                 

   

 

   

slides11a 828X 2019

slide-37
SLIDE 37

Properties of the IS estimator for P(Xi|e)

  • Convergence: By Weak law of large numbers
  • Asymptotically unbiased
  • Variance

– Harder to analyze – Liu suggests a measure called “Effective sample size”

   T as ) | ( ) | ( e x P e x P

i i

) | ( )] | ( [ lim e x P e x P E

i i P T

 

slides11a 828X 2019

slide-38
SLIDE 38

Effective Sample Size

possible. as small as be must weights the

  • f

variance the Therefore, Q. from samples T) ESS(Q, worth are P from samples T Thus ) , ( )] | ( [ )] | ( ˆ [ )] ( [ var 1 ) , ( : 1 ˆ : using e) | P(x estimate can we e), | P(z from samples Given ) | ( ) ( ) | (

1 i

T Q ESS T e x P Var e x P Var z w T T Q ESS Define ) (z g T |e) (x P e z P z g e x P

i Q i P Q T j t x i z x i

i i

    

 

Ideal estimator Measures how much the estimator deviates from the ideal one.

slides11a 828X 2019

slide-39
SLIDE 39

Generating Samples From Q

  • No restrictions on “how to”
  • Typically, express Q in product form:

– Q(Z)=Q(Z1)xQ(Z2|Z1)x….xQ(Zn|Z1,..Zn‐1)

  • Sample along the order Z1,..,Zn
  • Example:

– Z1Q(Z1)=(0.2,0.8) – Z2 Q(Z2|Z1)=(0.1,0.9,0.2,0.8) – Z3 Q(Z3|Z1,Z2)=Q(Z3)=(0.5,0.5)

slides11a 828X 2019

slide-40
SLIDE 40

Summary: IS for Common Queries

  • Partition function

– Ex: MRF, or BN with evidence – Unbiased; only requires evaluating unnormalized function f(x)

  • General expectations wrt p(x) / f(x)?

– E.g., marginal probabilities, etc.

Only asymptotically unbiased… Estimate separately

slides11a 828X 2019

slide-41
SLIDE 41

More on Properties of IS

  • Importance sampling:
  • IS is unbiased and fast if q(.) is easy to sample from
  • IS can be lower variance if q(.) is chosen well

– Ex: q(x) puts more probability mass where u(x) is large – Optimal: q(x) ∝ |u(x) p(x)|

  • IS can also give poor performance

– If q(x) << u(x) p(x): rare but very high weights! – Then, empirical variance is also unreliable! – For guarantees, need to analytically bound weights / variance…

How to get a good proposal?

slide-42
SLIDE 42

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State‐of‐the‐art importance sampling

techniques

slides11a 828X 2019

slide-43
SLIDE 43

Likelihood Weighting

(Fung and Chang, 1990; Shachter and Peot, 1990)

Works well for likely evidence!

“Clamping” evidence+ logic sampling+ weighing samples by evidence likelihood Is an instance of importance sampling!

slides11a 828X 2019

slide-44
SLIDE 44

Likelihood Weighting: Sampling

e e e e e e e e e

Sample in topological order over X ! Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look‐up in CPT!

slides11a 828X 2019

slide-45
SLIDE 45

Likelihood Weighting: Proposal Distribution

    

    

            

E E j j E X X i i E X X E E j j i i n E X X i i

j i i j i

pa e P e pa x P pa e P e pa x P x Q e x P w x x x Weights x X X X P X P e pa X P E X Q ) | ( ) , | ( ) | ( ) , | ( ) ( ) , ( ) ,.., ( : : ) , | ( ) ( ) , | ( ) \ (

\ \ \

sample a Given ) X , Q(X . x X Evidence and ) X , X | P(X ) X | P(X ) P(X ) X , X , P(X : network Bayesian a Given : Example

1 2 2 1 3 1 3 1 2 2 2 1 3 1 2 1 3 2 1

Notice: Q is another Bayesian network

slides11a 828X 2019

slide-46
SLIDE 46

Likelihood Weighting: Estimates

T t t

w T e P

1 ) (

1 ) ( ˆ

Estimate P(e):

  • therwise

zero equals and x if 1 ) ( ) ( ) ( ˆ ) , ( ˆ ) | ( ˆ

i ) ( 1 ) ( ) ( 1 ) ( t i t x T t t t x T t t i i

x x g w x g w e P e x P e x P

i i

   

 

 

Estimate Posterior Marginals:

slides11a 828X 2019

slide-47
SLIDE 47

Properties of Likelihood Weighting

  • Converges to exact posterior marginals
  • Generates Samples Fast
  • Sampling distribution is close to prior

(especially if E  Leaf Nodes)

  • Increasing sampling variance

Convergence may be slow Many samples with P(x(t))=0 rejected

slides11a 828X 2019

slide-48
SLIDE 48

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State‐of‐the‐art importance sampling

techniques

slides11a 828X 2019

slide-49
SLIDE 49

Proposal selection

  • One should try to select a proposal that is as

close as possible to the posterior distribution.

 

) | ( ) ( ) ( ) ( ) , ( estimator variance

  • zero

a have to , ) ( ) ( ) , ( ) ( ) ( ) ( ) , ( 1 )] ( [ ) ( ˆ

2

e z P z Q z Q e P e z P e P z Q e z P z Q e P z Q e z P N T z w Var e P Var

Z z Q Q

                

slides11a 828X 2019

slide-50
SLIDE 50

Proposal Distributions used in Literature

  • AIS‐BN (Adaptive proposal)
  • Cheng and Druzdzel, 2000
  • Iterative Belief Propagation
  • Changhe and Druzdzel, 2003
  • Iterative Join Graph Propagation (IJGP) and

variable ordering

  • Gogate and Dechter, 2005

slides11a 828X 2019

slide-51
SLIDE 51

Perfect sampling using Bucket Elimination

  • Algorithm:

– Run Bucket elimination on the problem along an

  • rdering o=(XN,..,X1).

– Sample along the reverse ordering: (X1,..,XN) – At each variable Xi, recover the probability P(Xi|x1,...,xi‐1) by referring to the bucket.

slides11a 828X 2019

slide-52
SLIDE 52

Exact Sampling using Bucket Elimination

  • Algorithm:

– Run Bucket elimination on the problem along an

  • rdering o=(X1,..,XN).

– Sample along the reverse ordering – At each branch point, recover the edge probabilities by performing a constant‐time table lookup! – Complexity: O(Bucket‐elimination)+O(M*n)

  • M is the number of solution samples and n is the

number of variables

slides11a 828X 2019

slide-53
SLIDE 53

Downward message normalizes bucket; ratio is a conditional distribution

E: C: D: B: A:

How to sample from a Markov network? Exact sampling via inference

  • Draw samples from P[A|E=e] directly?

– Model defines un‐normalized p(A,…,E=e) – Build (oriented) tree decomposition & sample

Z Work: O(exp(w)) to build distribution O(n d) to draw each sample

slides11a 828X 2019

slide-54
SLIDE 54

Bucket Elimination

) , ( ) | (    e a P e a P

 

d e b c

c b e P b a d P a c P a b P a P e a P

, , ,

) , | ( ) , | ( ) | ( ) | ( ) ( ) , (

   

d c b e

b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (

Elimination Order: d,e,b,c Query:

D: E: B: C: A:

d D

b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE  

b E D B

c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P

C

  ) (a P ) | ( a c P

c B C

c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P

D,A,B E,B,C B,A,C C,A A ) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC

A A D D E E C C B B

Bucket Tree

D E B C A

Original Functions Messages Time and space exp(w*) slides11a 828X 2019

slide-55
SLIDE 55

slides11a 828X 2019

Bucket elimination (BE)



b

Elimination operator

P(e)

bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A

e) (A, hD

(a) hE

e) C, D, (A, hB

e) D, (A, hC

A A D D E E C C B B

slide-56
SLIDE 56

slides11a 828X 2019

Sampling from the output of BE

(Dechter 2002) bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A:

e) (A, hD

(A) hE

e) C, D, (A, hB e) D, (A, hC

Q(A) a A : (A) h P(A) Q(A)

E

    Sample ignore : bucket Evidence

e) D, (a, h e) a, | Q(D d D : Sample bucket in the a A Set

C

   

e) C, d, (a, h ) | ( d) e, a, | Q(C c C : Sample bucket the in d D a, A Set

B

      A C P ) , | ( ) , | ( ) | ( d) e, a, | Q(B b B : Sample bucket the in c C d, D a, A Set c b e P a B d P a B P      

slide-57
SLIDE 57

slides11a 828X 2019

57

Mini‐Bucket Elimination

bucket A: bucket E: bucket D: bucket C: bucket B:

ΣB

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C)

Mini-buckets

ΣB

P(C|A) hB(C,e) hD(A) hC(A,e) Approximation of P(e) Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 BE generates a function having scope size 3. So it cannot be used. P(A)

slide-58
SLIDE 58

slides11a 828X 2019

58

Sampling from the output of MBE

bucket A: bucket E: bucket D: bucket C: bucket B:

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C) P(C|A) hB(C,e) hD(A) hC(A,e) Sampling is same as in BE‐sampling except that now we construct Q from a randomly selected “mini‐ bucket”

slide-59
SLIDE 59

IJGP‐Sampling

(Gogate and Dechter, 2005)

  • Iterative Join Graph Propagation (IJGP)

– A Generalized Belief Propagation scheme (Yedidia et al., 2002)

  • IJGP yields better approximations of P(X|E)

than MBE (Dechter, Kask and Mateescu, 2002)

  • Output of IJGP is same as mini‐bucket

“clusters”

  • Currently one of the best performing IS

scheme!

slides11a 828X 2019

slide-60
SLIDE 60

Example: IJGP‐Sampling

  • Run IJGP

A B C D E Sampling Order Approx #Solutions (i=2)

CD AD BCD BE DE ABE E

A C E D B

slides11a 828X 2019

slide-61
SLIDE 61

Current Research Question

  • Given a Bayesian network with evidence or a

Markov network representing function P, generate another Bayesian network representing a function Q (from a family of distributions, restricted by structure) such that Q is closest to P.

  • Current approaches

– Mini‐buckets – Ijgp – Both

  • Experimented, but need to be justified

theoretically.

slides11a 828X 2019

slide-62
SLIDE 62

Algorithm: Approximate Sampling

1) Run IJGP or MBE 2) At each branch point compute the edge probabilities by consulting output of IJGP or MBE

  • Rejection Problem:

– Some assignments generated are non solutions

slides11a 828X 2019

slide-63
SLIDE 63

 

k ) ( ˆ Re ' ) ( Q Update ) ( N 1 ) ( ˆ e) (E P ˆ Q z ,..., z samples Generate do k to 1 i For ) ( ˆ )) ( | ( ... )) ( | ( ) ( ) ( Q Proposal Initial

1 k 1 N 1 2 2 1 1

e E P turn End Q Q k Q z w e E P from e E P Z pa Z Q Z pa Z Q Z Q Z

k k i N j k k n n

               

 

Adaptive Importance Sampling

slides11a 828X 2019

slide-64
SLIDE 64

Adaptive Importance Sampling

  • General case
  • Given k proposal distributions
  • Take N samples out of each distribution
  • Approximate P(e)

 

1 ) ( ˆ

1

   

k j

proposal jth weight Avg k e P

slides11a 828X 2019

slide-65
SLIDE 65

Estimating Q'(z)

sampling importance by estimated is ) Z ,.., Z | (Z Q' each where )) ( | ( ' ... )) ( | ( ' ) ( ' ) ( Q

1

  • i

1 i 2 2 1 ' n n

Z pa Z Q Z pa Z Q Z Q Z    

slides11a 828X 2019

slide-66
SLIDE 66

Choosing a proposal (wmb‐IS)

  • Can use WMB upper bound to define a proposal q(x):

E: C: D: B: A:

mini‐buckets

U = upper bound

Weighted mixture: use minibucket 1 with probability w1

  • r, minibucket 2 with probability w2 = 1 ‐ w1

where Key insight: provides bounded importance weights! [Liu, Fisher, Ihler 2015]

slides11a 828X 2019

slide-67
SLIDE 67

WMB‐IS Bounds

  • Finite sample bounds on the average
  • Compare to forward sampling

101 102 103 104 105 Sample Size (m) 101 102 103 104 105 106 Sample Size (m) BN_6 BN_11 ‐58.4 ‐53 ‐63 ‐39.4 ‐34 ‐44

“Empirical Bernstein” bounds

[Liu, Fisher, Ihler 2015]

slides11a 828X 2019

slide-68
SLIDE 68

Other Choices of Proposals

  • Belief propagation

– BP‐based proposal [Changhe & Druzdzel 2003] – Join‐graph BP proposal [Gogate & Dechter 2005] – Mean field proposal [Wexler & Geiger 2007] E: C: D: B: A:

Join graph:

{B|A,C} {B|D,E} {C|A,E} {D|A,E} {E|A} {A}

{B} {D,E} {A} {A} {A,E} {A,C}

slides11a 828X 2019

slide-69
SLIDE 69

Other Choices of Proposals

  • Belief propagation

– BP‐based proposal [Changhe & Druzdzel 2003] – Join‐graph BP proposal [Gogate & Dechter 2005] – Mean field proposal [Wexler & Geiger 2007]

  • Adaptive importance sampling

– Use already‐drawn samples to update q(x) – Rates vt and ´t adapt estimates, proposal – Ex: [Cheng & Druzdzel 2000] [Lapeyre & Boyd 2010] … – Lose “iid”‐ness of samples

slides11a 828X 2019

slide-70
SLIDE 70

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao‐Blackwellisation
  • 6. AND/OR importance sampling

slides11a 828X 2019

slide-71
SLIDE 71

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • Error estimation
  • State‐of‐the‐art importance sampling

techniques

slides11a 828X 2019

slide-72
SLIDE 72

slides11a 828X 2019

slide-73
SLIDE 73

slides11a 828X 2019

slide-74
SLIDE 74

Logic Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting

from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most  with probability at least 1‐ it is enough to have:

  1 ) (

2 

  y P c T

Derived from Chebychev’s Bound.

 

2

2

2 ] ) ( , ) ( [ ) ( Pr

 

N

e y P y P y P

   

slides11a 828X 2019

slide-75
SLIDE 75

Logic Sampling: Performance

Advantages:

  • P(xi | pa(xi)) is readily available
  • Samples are independent !

Drawbacks:

  • If evidence E is rare (P(e) is low), then we will reject

most of the samples!

  • Since P(y) in estimate of T is unknown, must estimate

P(y) from samples themselves!

  • If P(e) is small, T will become very big!

slides11a 828X 2019

slide-76
SLIDE 76

slides11a 828X 2019

slide-77
SLIDE 77

absolute slides11a 828X 2019

slide-78
SLIDE 78

Bucket Elimination Overview

A C E D B F(B,C) F(B,D) F(B,E) F’(B,D,E) F’(B,D,E) F(A,E) F(A,D) F(A,B) F(C,D) F’(C,D,E) F(D,E) F’(D,E) F’(C,D,E) F’(D,E) F’(E) F’(E) D: C: B: E: A:

Sampling Direction

Complexity: Exp (3) or n3

slides11a 828X 2019