Sampling Techniques for Probabilistic and Deterministic Graphical - - PowerPoint PPT Presentation

sampling techniques for probabilistic
SMART_READER_LITE
LIVE PREVIEW

Sampling Techniques for Probabilistic and Deterministic Graphical - - PowerPoint PPT Presentation

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017 Bozhena Bidyuk Rina Dechter Reading Darwiche chapter 15, related papers Overview 1. Probabilistic Reasoning/Graphical models 2. Importance


slide-1
SLIDE 1

Sampling Techniques for Probabilistic and Deterministic Graphical models

ICS 276, Spring 2017 Bozhena Bidyuk Rina Dechter

Reading” Darwiche chapter 15, related papers

slide-2
SLIDE 2

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-3
SLIDE 3

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Cutset-based Variance Reduction
  • 6. AND/OR importance sampling
slide-4
SLIDE 4

Probabilistic Reasoning; Graphical models

  • Graphical models:

– Bayesian network, constraint networks, mixed network

  • Queries
  • Exact algorithm

– using inference, – search and hybrids

  • Graph parameters:

– tree-width, cycle-cutset, w-cutset

slide-5
SLIDE 5

Queries

  • Probability of evidence (or partition function)
  • Posterior marginal (beliefs):
  • Most Probable Explanation

 

 

) var( 1

| ) | ( ) (

e X n i e i i pa

x P e P



X i i i C

Z ) ( 

   

    

 

) var( 1 ) var( 1

| ) | ( | ) | ( ) ( ) , ( ) | (

e X n j e j j X e X n j e j j i i

pa x P pa x P e P e x P e x P

i

e) , x P( max arg * x

x

slide-6
SLIDE 6

8

Approximation

  • Since inference, search and hybrids are too expensive when

graph is dense; (high treewidth) then:

  • Bounding inference: (week 8)
  • mini-bucket and mini-clustering
  • Belief propagation
  • Bounding search: (week 7)
  • Sampling
  • Goal: an anytime scheme
slide-7
SLIDE 7

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-8
SLIDE 8

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

10

slide-9
SLIDE 9

A sample

  • Given a set of variables X={X1,...,Xn}, a sample,

denoted by St is an instantiation of all variables:

11

) ,..., , (

t n t t t

x x x S

2 1

slide-10
SLIDE 10

How to draw a sample ? Univariate distribution

  • Example: Given random variable X having

domain {0, 1} and a distribution P(X) = (0.3, 0.7).

  • Task: Generate samples of X from P.
  • How?

– draw random number r  [0, 1] – If (r < 0.3) then set X=0 – Else set X=1

12

slide-11
SLIDE 11

How to draw a sample? Multi-variate distribution

  • Let X={X1,..,Xn} be a set of variables
  • Express the distribution in product form
  • Sample variables one by one from left to right,

along the ordering dictated by the product form.

  • Bayesian network literature: Logic sampling

13

) ,..., | ( ... ) | ( ) ( ) (

1 1 1 2 1 

   

n n

X X X P X X P X P X P

slide-12
SLIDE 12

Sampling for Prob. Inference Outline

  • Logic Sampling
  • Importance Sampling

– Likelihood Sampling – Choosing a Proposal Distribution

  • Markov Chain Monte Carlo (MCMC)

– Metropolis-Hastings – Gibbs sampling

  • Variance Reduction
slide-13
SLIDE 13

Logic Sampling: No Evidence (Henrion 1988)

Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1. For t = 0 to T 2. For i = 0 to N 3. Xi  sample xi

t from P(xi | pai)

15

slide-14
SLIDE 14

Logic sampling (example)

16

X1 X4 X2 X3

) (

1

X P

) | (

1 2 X

X P

) | (

1 3 X

X P

) | ( from Sample . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // Evidence No

3 3 , 2 2 4 4 1 1 3 3 1 1 2 2 1 1

x X x X x P x x X x P x x X x P x x P x k    

) , | ( ) | ( ) | ( ) ( ) , , , (

3 2 4 1 3 1 2 1 4 3 2 1

X X X P X X P X X P X P X X X X P    

) , | (

3 2 4

X X X P

slide-15
SLIDE 15

Logic Sampling w/ Evidence

Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E

  • 1. For t=1 to T

2. For i=1 to N 3. Xi  sample xi

t from P(xi | pai)

4. If Xi in E and Xi  xi, reject sample: 5. Goto Step 1.

17

slide-16
SLIDE 16

Logic Sampling (example)

18

X1 X4 X2 X3

) ( 1 x P

) | (

1 2 x

x P ) , | (

3 2 4

x x x P

) | (

1 3 x

x P

) | ( from Sample 5.

  • therwise

1, from start and sample reject 0, If . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // : Evidence

3 , 2 4 4 3 1 3 3 1 2 2 1 1 3

x x x P x x x x P x x x P x x P x k X  

slide-17
SLIDE 17

Expected value: Given a probability distribution P(X) and a function g(X) defined over a set of variables X = {X1, X2, … Xn}, the expected value of g w.r.t. P is Variance: The variance of g w.r.t. P is:

Expected value and Variance

20

) ( ) ( )] ( [ x P x g x g E

x P

 

) ( )] ( [ ) ( )] ( [

2

x P x g E x g x g Var

x P P

 

slide-18
SLIDE 18

Monte Carlo Estimate

  • Estimator:

– An estimator is a function of the samples. – It produces an estimate of the unknown parameter of the sampling distribution.

21

 

 

T t t P

S g T g P

1 T 2 1

1 : by given is [g(x)] E

  • f

estimate carlo Monte the , from drawn S , S , S samples i.i.d. Given ) ( ˆ

slide-19
SLIDE 19

Example: Monte Carlo estimate

  • Given:

– A distribution P(X) = (0.3, 0.7). – g(X) = 40 if X equals 0 = 50 if X equals 1.

  • Estimate EP[g(x)]=(40x0.3+50x0.7)=47.
  • Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

22

46 10 6 50 4 40 1 50 40            samples X samples X samples g # ) ( # ) ( # ˆ

slide-20
SLIDE 20

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

23

slide-21
SLIDE 21

Importance sampling: Main idea

  • Express query as the expected value of a

random variable w.r.t. to a distribution Q.

  • Generate random samples from Q.
  • Estimate the expected value from the

generated samples using a monte carlo estimator (average).

24

slide-22
SLIDE 22

Importance sampling for P(e)

) ( , ) ( ) ( ˆ : )] ( [ ) ( ) , ( ) ( ) ( ) , ( ) , ( ) ( ) ( ) , ( , \ Z Q z w T e P z w E z Q e z P E z Q z Q e z P e z P e P z Q e z P E X Z Let

t T t t Q Q z z

               

  

z where 1 estimate Carlo Monte : as P(e) rewrite can we Then, satisfying

  • n,

distributi (proposal) a be Q(Z) Let

1

slide-23
SLIDE 23

Properties of IS estimate of P(e)

  • Convergence: by law of large numbers
  • Unbiased.
  • Variance:

       T for ) ( ) ( 1 ) ( ˆ

. . 1

e P z w T e P

s a T i i

 

T z w Var z w T Var e P Var

Q N i i Q Q

)] ( [ ) ( 1 ) ( ˆ

1

       

 

) ( )] ( ˆ [ e P e P EQ 

slide-24
SLIDE 24

Properties of IS estimate of P(e)

  • Mean Squared Error of the estimator

 

   

   

T x w Var e P Var e P Var e P E e P e P e P E e P MSE

Q Q Q Q Q Q

)] ( [ ) ( ˆ ) ( ˆ )] ( ˆ [ ) ( ) ( ) ( ˆ ) ( ˆ             

2 2

This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)

slide-25
SLIDE 25

Estimating P(Xi|e)

29

 

) | ( ) | ( E : biased is Estimate , , ) ( ˆ ) , ( ˆ ) | ( : estimate Ratio IS. by r denominato and numerator Estimate : Idea ) ( ) , ( ) ( ) , ( ) ( ) , ( ) , ( ) ( ) ( ) , ( ) | (

  • therwise.

and x contains z if 1 is which function, delta

  • dirac

a be (z) Let

T 1 k T 1 k i xi

e x P e x P e) w(z e) )w(z (z e P e x P e x P z Q e z P E z Q e z P z E e z P e z P z e P e x P e x P

i i k k k x i i Q x Q z z x i i

i i i

                 

   

 

   

slide-26
SLIDE 26

Properties of the IS estimator for P(Xi|e)

  • Convergence: By Weak law of large numbers
  • Asymptotically unbiased
  • Variance

– Harder to analyze – Liu suggests a measure called “Effective sample size”

30

   T as ) | ( ) | ( e x P e x P

i i

) | ( )] | ( [ lim e x P e x P E

i i P T

 

slide-27
SLIDE 27

Generating samples from Q

  • No restrictions on “how to”
  • Typically, express Q in product form:

– Q(Z)=Q(Z1)xQ(Z2|Z1)x….xQ(Zn|Z1,..Zn-1)

  • Sample along the order Z1,..,Zn
  • Example:

– Z1Q(Z1)=(0.2,0.8) – Z2 Q(Z2|Z1)=(0.1,0.9,0.2,0.8) – Z3 Q(Z3|Z1,Z2)=Q(Z3)=(0.5,0.5)

slide-28
SLIDE 28

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

33

slide-29
SLIDE 29

Likelihood Weighting

(Fung and Chang, 1990; Shachter and Peot, 1990)

34

Works well for likely evidence!

“Clamping” evidence+ logic sampling+ weighing samples by evidence likelihood Is an instance of importance sampling!

slide-30
SLIDE 30

Likelihood Weighting: Sampling

35

e e e e e e e e e

Sample in topological order over X ! Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!

slide-31
SLIDE 31

Likelihood Weighting: Proposal Distribution

36

    

    

            

E E j j E X X i i E X X E E j j i i n E X X i i

j i i j i

pa e P e pa x P pa e P e pa x P x Q e x P w x x x Weights x X X X P X P e pa X P E X Q ) | ( ) , | ( ) | ( ) , | ( ) ( ) , ( ) ,.., ( : : ) , | ( ) ( ) , | ( ) \ (

\ \ \

sample a Given ) X , Q(X . x X Evidence and ) X , X | P(X ) X | P(X ) P(X ) X , X , P(X : network Bayesian a Given : Example

1 2 2 1 3 1 3 1 2 2 2 1 3 1 2 1 3 2 1

Notice: Q is another Bayesian network

slide-32
SLIDE 32

Likelihood Weighting: Estimates

37

T t t

w T e P

1 ) (

1 ) ( ˆ

Estimate P(e):

  • therwise

zero equals and x if 1 ) ( ) ( ) ( ˆ ) , ( ˆ ) | ( ˆ

i ) ( 1 ) ( ) ( 1 ) ( t i t x T t t t x T t t i i

x x g w x g w e P e x P e x P

i i

   

 

 

Estimate Posterior Marginals:

slide-33
SLIDE 33

Likelihood Weighting

  • Converges to exact posterior marginals
  • Generates Samples Fast
  • Sampling distribution is close to prior

(especially if E  Leaf Nodes)

  • Increasing sampling variance

Convergence may be slow Many samples with P(x(t))=0 rejected

38

slide-34
SLIDE 34

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • Error estimation
  • State-of-the-art importance sampling

techniques

39

slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

absolute

slide-39
SLIDE 39

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

46

slide-40
SLIDE 40

Proposal selection

  • One should try to select a proposal that is as

close as possible to the posterior distribution.

 

) | ( ) ( ) ( ) ( ) , ( estimator variance

  • zero

a have to , ) ( ) ( ) , ( ) ( ) ( ) ( ) , ( 1 )] ( [ ) ( ˆ

2

e z P z Q z Q e P e z P e P z Q e z P z Q e P z Q e z P N T z w Var e P Var

Z z Q Q

                

slide-41
SLIDE 41

Perfect sampling using Bucket Elimination

  • Algorithm:

– Run Bucket elimination on the problem along an

  • rdering o=(XN,..,X1).

– Sample along the reverse ordering: (X1,..,XN) – At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.

slide-42
SLIDE 42

51

Bucket Elimination

) , ( ) | (    e a P e a P

 

d e b c

c b e P b a d P a c P a b P a P e a P

, , ,

) , | ( ) , | ( ) | ( ) | ( ) ( ) , (

   

d c b e

b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (

Elimination Order: d,e,b,c Query:

D: E: B: C: A:

d D

b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE  

b E D B

c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P

C

  ) (a P ) | ( a c P

c B C

c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P

D,A,B E,B,C B,A,C C,A A ) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC

A A D D E E C C B B

Bucket Tree

D E B C A

Original Functions Messages Time and space exp(w*)

slide-43
SLIDE 43

SP2 53

Bucket elimination (BE)

Algorithm elim-bel (Dechter 1996)



b

Elimination operator

P(e)

bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A

e) (A, hD

(a) hE

e) C, D, (A, hB e) D, (A, hC

slide-44
SLIDE 44

SP2 54

Sampling from the output of BE

(Dechter 2002) bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A:

e) (A, hD

(A) hE

e) C, D, (A, hB

e) D, (A, hC

Q(A) a A : (A) h P(A) Q(A)

E

    Sample ignore : bucket Evidence

e) D, (a, h e) a, | Q(D d D : Sample bucket in the a A Set

C

   

e) C, d, (a, h ) | ( d) e, a, | Q(C c C : Sample bucket the in d D a, A Set

B

      A C P

) , | ( ) , | ( ) | ( d) e, a, | Q(B b B : Sample bucket the in c C d, D a, A Set c b e P a B d P a B P      

slide-45
SLIDE 45

Mini-buckets: “local inference”

  • Computation in a bucket is time and space

exponential in the number of variables involved

  • Therefore, partition functions in a bucket into

“mini-buckets” on smaller number of variables

  • Can control the size of each “mini-bucket”,

yielding polynomial complexity.

SP2 55

slide-46
SLIDE 46

SP2 56 56

Mini-Bucket Elimination

bucket A: bucket E: bucket D: bucket C: bucket B:

ΣB

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C)

Mini-buckets

ΣB

P(C|A) hB(C,e) hD(A) hC(A,e) Approximation of P(e) Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 BE generates a function having scope size 3. So it cannot be used. P(A)

slide-47
SLIDE 47

SP2 57 57

Sampling from the output of MBE

bucket A: bucket E: bucket D: bucket C: bucket B:

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C) P(C|A) hB(C,e) hD(A) hC(A,e) Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini- bucket”

slide-48
SLIDE 48

IJGP-Sampling (Gogate and Dechter, 2005)

  • Iterative Join Graph Propagation (IJGP)

– A Generalized Belief Propagation scheme (Yedidia et al., 2002)

  • IJGP yields better approximations of P(X|E)

than MBE

– (Dechter, Kask and Mateescu, 2002)

  • Output of IJGP is same as mini-bucket

“clusters”

  • Currently the best performing IS scheme!
slide-49
SLIDE 49

Current Research question

  • Given a Bayesian network with evidence or a

Markov network representing function P, generate another Bayesian network representing a function Q (from a family of distributions, restricted by structure) such that Q is closest to P.

  • Current approaches

– Mini-buckets – Ijgp – Both

  • Experimented, but need to be justified

theoretically.

slide-50
SLIDE 50

Algorithm: Approximate Sampling

1) Run IJGP or MBE 2) At each branch point compute the edge probabilities by consulting output of IJGP or MBE

  • Rejection Problem:

– Some assignments generated are non solutions

slide-51
SLIDE 51

 

k ) ( ˆ Re ' ) ( Q Update ) ( N 1 ) ( ˆ e) (E P ˆ Q z ,..., z samples Generate do k to 1 i For ) ( ˆ )) ( | ( ... )) ( | ( ) ( ) ( Q Proposal Initial

1 k 1 N 1 2 2 1 1

e E P turn End Q Q k Q z w e E P from e E P Z pa Z Q Z pa Z Q Z Q Z

k k i N j k k n n

               

 

Adaptive Importance Sampling

slide-52
SLIDE 52

Adaptive Importance Sampling

  • General case
  • Given k proposal distributions
  • Take N samples out of each distribution
  • Approximate P(e)

 

1 ) ( ˆ

1

   

k j

proposal jth weight Avg k e P

slide-53
SLIDE 53

Estimating Q'(z)

sampling importance by estimated is ) Z ,.., Z | (Z Q' each where )) ( | ( ' ... )) ( | ( ' ) ( ' ) ( Q

1

  • i

1 i 2 2 1 ' n n

Z pa Z Q Z pa Z Q Z Q Z    

slide-54
SLIDE 54

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling