Inference in Bayesian Networks CE417: Introduction to Artificial - - PowerPoint PPT Presentation

inference in bayesian networks
SMART_READER_LITE
LIVE PREVIEW

Inference in Bayesian Networks CE417: Introduction to Artificial - - PowerPoint PPT Presentation

Inference in Bayesian Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Slides are based on Klein and Abdeel, CS188, UC Berkeley. Bayes Nets } Representation } Conditional


slide-1
SLIDE 1

Inference in Bayesian Networks

CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019

Soleymani

Slides are based on Klein and Abdeel, CS188, UC Berkeley.

slide-2
SLIDE 2

Bayes’ Nets

} Representation } Conditional Independences } Probabilistic Inference

} Enumeration (exact, exponential complexity) } Variable elimination (exact, worst-case

exponential complexity, often better)

} Probabilistic inference is NP-complete } Sampling (approximate)

} Learning Bayes’ Nets from Data

2

slide-3
SLIDE 3

Recap: Bayes’ Net Representation

} A directed, acyclic graph, one node per random

variable

} A conditional probability table (CPT) for each node

}

A collection of distributions over X, one for each combination of parents’ values

} Bayes’ nets implicitly encode joint distributions

}

As a product of local conditional distributions

}

To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together:

3

slide-4
SLIDE 4

Example: Alarm Network

Burglary Earthqk Alarm John calls Mary calls

B P(B) +b 0.001

  • b

0.999 E P(E) +e 0.002

  • e

0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e

  • a

0.05 +b

  • e

+a 0.94 +b

  • e
  • a

0.06

  • b

+e +a 0.29

  • b

+e

  • a

0.71

  • b
  • e

+a 0.001

  • b
  • e
  • a

0.999 A J P(J|A) +a +j 0.9 +a

  • j

0.1

  • a

+j 0.05

  • a
  • j

0.95 A M P(M|A) +a +m 0.7 +a

  • m

0.3

  • a

+m 0.01

  • a
  • m

0.99

[Demo: BN Applet]

4

slide-5
SLIDE 5

Video of Demo BN Applet

5

slide-6
SLIDE 6

Example: Alarm Network

B P(B) +b 0.001

  • b

0.999 E P(E) +e 0.002

  • e

0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e

  • a

0.05 +b

  • e

+a 0.94 +b

  • e
  • a

0.06

  • b

+e +a 0.29

  • b

+e

  • a

0.71

  • b
  • e

+a 0.001

  • b
  • e
  • a

0.999 A J P(J|A) +a +j 0.9 +a

  • j

0.1

  • a

+j 0.05

  • a
  • j

0.95 A M P(M|A) +a +m 0.7 +a

  • m

0.3

  • a

+m 0.01

  • a
  • m

0.99

B E A M J

6

slide-7
SLIDE 7

Example: Alarm Network

B P(B) +b 0.001

  • b

0.999 E P(E) +e 0.002

  • e

0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e

  • a

0.05 +b

  • e

+a 0.94 +b

  • e
  • a

0.06

  • b

+e +a 0.29

  • b

+e

  • a

0.71

  • b
  • e

+a 0.001

  • b
  • e
  • a

0.999 A J P(J|A) +a +j 0.9 +a

  • j

0.1

  • a

+j 0.05

  • a
  • j

0.95 A M P(M|A) +a +m 0.7 +a

  • m

0.3

  • a

+m 0.01

  • a
  • m

0.99

B E A M J

7

slide-8
SLIDE 8

Bayes’ Nets

}

Representation

}

Conditional Independences

}

Probabilistic Inference

}

Enumeration (exact, exponential complexity)

}

Variable elimination (exact, worst-case exponential complexity, often better)

}

Inference is NP-complete

}

Sampling (approximate)

}

Learning Bayes’ Nets from Data

8

slide-9
SLIDE 9

§ Examples:

§ Posterior probability § Most likely explanation:

Inference

} Inference:

calculating some useful quantity from a joint probability distribution

9

slide-10
SLIDE 10

Inference by Enumeration

}

General case:

}

Evidence variables:

}

Query* variable:

}

Hidden variables:

All variables

* Works fine with multiple query variables, too

§ We want: § Step 1: Select the entries consistent with the evidence § Step 2: Sum out H to get joint

  • f Query and evidence

§ Step 3: Normalize

× 1 Z

10

slide-11
SLIDE 11

Inference by Enumeration in Bayes’ Net

} Given unlimited time, inference in BNs is easy } Reminder of inference by enumeration by example:

B E A M J P(B | + j, +m) ∝B P(B, +j, +m)

= X

e,a

P(B, e, a, +j, +m)

= X

e,a

P(B)P(e)P(a|B, e)P(+j|a)P(+m|a)

=P(B)P(+e)P(+a|B, +e)P(+j| + a)P(+m| + a) + P(B)P(+e)P(−a|B, +e)P(+j| − a)P(+m| − a) P(B)P(−e)P(+a|B, −e)P(+j| + a)P(+m| + a) + P(B)P(−e)P(−a|B, −e)P(+j| − a)P(+m| − a)

11

slide-12
SLIDE 12

Burglary example: full joint probability

12

𝑄 𝑐 𝑘, ¬𝑛 = 𝑄 𝑘, ¬𝑛, 𝑐 𝑄 𝑘, ¬𝑛 = ∑ ∑ 𝑄 𝑘, ¬𝑛, 𝑐, 𝐵, 𝐹

+ ,

∑ ∑ ∑ 𝑄 𝑘, ¬𝑛, 𝑐, 𝐵, 𝐹

+ ,

  • =

∑ ∑ 𝑄 𝑘 𝐵 𝑄 ¬𝑛 𝐵 𝑄 𝐵 𝑐, 𝐹 𝑄 𝑐 𝑄(𝐹)

+ ,

∑ ∑ ∑ 𝑄 𝑘 𝐵 𝑄 ¬𝑛 𝐵 𝑄 𝐵 𝐶, 𝐹 𝑄 𝐶 𝑄(𝐹)

+ ,

  • 𝑘: 𝐾𝑝ℎ𝑜𝐷𝑏𝑚𝑚𝑡 = 𝑈𝑠𝑣𝑓

¬𝑐: 𝐶𝑣𝑠𝑕𝑚𝑏𝑠𝑧 = 𝐺𝑏𝑚𝑡𝑓 … Short-hands

slide-13
SLIDE 13

Inference by Enumeration?

P(Antilock|observed variables) = ?

13

slide-14
SLIDE 14

Factor Zoo

14

slide-15
SLIDE 15

Factor Zoo I

} Joint distribution: P(X,Y)

}

Entries P(x,y) for all x, y

}

Sums to 1

} Selected joint: P(x,Y)

}

A slice of the joint distribution

}

Entries P(x,y) for fixed x, all y

}

Sums to P(x)

} Number of capitals = dimensionality of the table T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 T W P cold sun 0.2 cold rain 0.3 15

slide-16
SLIDE 16

Factor Zoo II

} Single conditional: P(Y | x)

}

Entries P(y | x) for fixed x, all y

}

Sums to 1

} Family of conditionals:

P(X |Y)

}

Multiple conditionals

}

Entries P(x | y) for all x, y

}

Sums to |Y|

T W P hot sun 0.8 hot rain 0.2 cold sun 0.4 cold rain 0.6 T W P cold sun 0.4 cold rain 0.6 16

slide-17
SLIDE 17

Factor Zoo III

} Specified family: P( y | X )

}

Entries P(y | x) for fixed y, but for all x

}

Sums to … who knows!

T W P hot rain 0.2 cold rain 0.6 17

slide-18
SLIDE 18

Factor Zoo Summary

§ In general, when we write P(Y1 … YN | X1 … XM)

§ It is a “factor,” a multi-dimensional array § Its values are P(y1 … yN | x1 … xM) § Any assigned (=lower-case) X or Y is a dimension missing (selected) from the array

18

slide-19
SLIDE 19

Example: Traffic Domain

} RandomVariables

} R: Raining } T:Traffic } L: Late for class!

T L R

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t

0.2

  • r

+t 0.1

  • r
  • t

0.9 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9

P(L) = ?

= X

r,t

P(r, t, L)

= X

r,t

P(r)P(t|r)P(L|t)

19

slide-20
SLIDE 20

Inference by Enumeration: Procedural Outline

} Track objects called factors } Initial factors are local CPTs (one per node) } Any known values are selected

} E.g. if we know

, the initial factors are } Procedure: Join all factors, then eliminate all hidden variables

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t

0.2

  • r

+t 0.1

  • r
  • t

0.9 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9 +t +l 0.3

  • t

+l 0.1 +r 0.1

  • r

0.9 +r +t 0.8 +r

  • t

0.2

  • r

+t 0.1

  • r
  • t

0.9

20

slide-21
SLIDE 21

Operation 1: Join Factors

} First basic operation: joining factors } Combining factors:

}

Just like a database join

}

Get all factors over the joining variable

}

Build a new factor over the union of the variables involved

} Example: Join on R

}

Computation for each entry: pointwise products

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t

0.2

  • r

+t 0.1

  • r
  • t

0.9 +r +t 0.08 +r

  • t

0.02

  • r

+t 0.09

  • r
  • t

0.81

T R R,T

21

slide-22
SLIDE 22

Example: Multiple Joins

22

slide-23
SLIDE 23

Example: Multiple Joins

T R

Join R

L R, T L

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t 0.2
  • r

+t 0.1

  • r
  • t 0.9

+t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9 +r +t 0.08 +r

  • t

0.02

  • r

+t 0.09

  • r
  • t

0.81 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9

R, T, L

+r +t +l

0.024

+r +t

  • l

0.056

+r

  • t

+l

0.002

+r

  • t
  • l

0.018

  • r

+t +l

0.027

  • r

+t

  • l

0.063

  • r
  • t

+l

0.081

  • r
  • t
  • l

0.729

Join T

23

slide-24
SLIDE 24

Operation 2: Eliminate

} Second basic operation: marginalization } Take a factor and sum out a variable

}

Shrinks a factor to a smaller one

}

A projection operation

} Example: +r +t 0.08 +r

  • t

0.02

  • r

+t 0.09

  • r
  • t

0.81 +t 0.17

  • t

0.83

24

slide-25
SLIDE 25

Multiple Elimination

Sum

  • ut R

Sum

  • ut T

T, L L R, T, L

+r +t +l

0.024

+r +t

  • l

0.056

+r

  • t

+l

0.002

+r

  • t
  • l

0.018

  • r

+t +l

0.027

  • r

+t

  • l

0.063

  • r
  • t

+l

0.081

  • r
  • t
  • l

0.729 +t +l

0.051

+t

  • l

0.119

  • t

+l

0.083

  • t
  • l

0.747

+l 0.134

  • l

0.886

25

slide-26
SLIDE 26

Thus Far: Multiple Join, Multiple Eliminate (= Inference by Enumeration)

26

slide-27
SLIDE 27

Inference by Enumeration vs. Variable Elimination

} Why is inference by enumeration so slow?

}

You join up the whole joint distribution before you sum out the hidden variables

§ Idea: interleave joining and marginalizing!

§ Called “Variable Elimination” § Still NP-hard, but usually much faster than inference by enumeration § First we’ll need some new notation: factors

27

slide-28
SLIDE 28

Traffic Domain

} Inference by Enumeration

T L R

P(L) = ?

§ Variable Elimination

= X

t

P(L|t) X

r

P(r)P(t|r) Join on r Join on r Join on t Join on t Eliminate r Eliminate t Eliminate r

= X

t

X

r

P(L|t)P(r)P(t|r)

Eliminate t

28

slide-29
SLIDE 29

Marginalizing Early (= Variable Elimination)

29

slide-30
SLIDE 30

Marginalizing Early! (aka VE)

Sum out R

T L

+r +t 0.08 +r

  • t

0.02

  • r

+t 0.09

  • r
  • t

0.81 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9 +t 0.17

  • t

0.83 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9

T R L

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t 0.2
  • r

+t 0.1

  • r
  • t 0.9

+t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9 Join R

R, T L T, L L

+t +l

0.051

+t

  • l

0.119

  • t

+l

0.083

  • t
  • l

0.747

+l 0.134

  • l

0.866 Join T Sum out T

30

slide-31
SLIDE 31

Evidence

} If evidence, start with factors that select that evidence

} No evidence uses these initial factors: } Computing

, the initial factors become:

} We eliminate all vars other than query + evidence

+r 0.1

  • r

0.9 +r +t 0.8 +r

  • t

0.2

  • r

+t 0.1

  • r
  • t

0.9 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9 +r 0.1 +r +t 0.8 +r

  • t

0.2 +t +l 0.3 +t

  • l

0.7

  • t

+l 0.1

  • t
  • l

0.9

31

slide-32
SLIDE 32

Evidence II

} Result will be a selected joint of query and evidence

}

E.g. for P(L | +r), we would end up with:

} To get our answer, just normalize this! } That ’s it! +l 0.26

  • l

0.74 +r +l 0.026 +r

  • l

0.074 Normalize

32

slide-33
SLIDE 33

General Variable Elimination

} Query: } Start with initial factors:

}

Local CPTs (but instantiated by evidence)

} While there are still hidden variables (not Q

  • r evidence):

}

Pick a hidden variable H

}

Join all factors mentioning H

}

Eliminate (sum out) H

} Join all remaining factors and normalize

33

slide-34
SLIDE 34

Distribution of products on sums

34

} Exploiting the factorization properties to allow sums and

products to be interchanged

} 𝑏×𝑐 + 𝑏×𝑑 needs three operations while 𝑏×(𝑐 + 𝑑) requires

two

slide-35
SLIDE 35

Variable elimination: example

35

𝑄 𝑐, 𝑘 = E E E 𝑄 𝑐 𝑄 𝐹 𝑄 𝐵 𝑐, 𝐹 𝑄 𝑘 𝐵 𝑄 𝑁 𝐵

G + ,

= 𝑄 𝑐 E 𝑄 𝐹 E 𝑄 𝐵 𝑐, 𝐹 𝑄 𝑘 𝐵 E 𝑄 𝑁 𝐵

G , +

𝑄 𝑐|𝑘 ∝ 𝑄(𝑐, 𝑘)

Intermediate results are probability distributions

slide-36
SLIDE 36

Variable elimination: example

36

𝑄 𝐶, 𝑘 = E E E 𝑄 𝐶 𝑄 𝐹 𝑄 𝐵 𝐶, 𝐹 𝑄 𝑘 𝐵 𝑄 𝑁 𝐵

G + ,

= 𝑄 𝐶 E 𝑄 𝐹 E 𝑄 𝐵 𝐶, 𝐹 𝑄 𝑘 𝐵 E 𝑄 𝑁 𝐵

G , +

𝒈K(𝐵) 1 1 𝒈M 𝐶, 𝐹 = E 𝒈N(𝐵, 𝐶, 𝐹)×𝒈K(𝐵) ×𝒈O(𝐵)

  • ,

𝒈Q 𝐶 = E 𝒈R(𝐹)×𝒈M 𝐶, 𝐹

  • +

𝑄 𝐶|𝑘 ∝ 𝑄(𝐶, 𝑘)

𝒈N(𝐵, 𝐶, 𝐹) 𝒈S(𝐶) 𝒈R(𝐹) 𝒈T(𝐵, 𝑁) 𝒈O(𝐵)

Intermediate results are probability distributions

slide-37
SLIDE 37

Variable elimination: Order of summations

37

} An inefficient order:

𝑄 𝐶, 𝑘 = E E E 𝑄 𝐶 𝑄 𝐹 𝑄 𝐵 𝐶, 𝐹 𝑄 𝑘 𝐵 𝑄 𝑁 𝐵

, + G

= 𝑄 𝐶 E E 𝑄 𝐹 E 𝑄 𝐵 𝐶, 𝐹 𝑄 𝑘 𝐵 𝑄 𝑁 𝐵

, + G

𝒈(𝐵, 𝐶, 𝐹, 𝑁)

slide-38
SLIDE 38

Variable elimination: Pruning irrelevant variables

38

} Any variable that is not an ancestor of a query variable or

evidence variable is irrelevant to the query.

} Prune all non-ancestors of query or evidence variables: } 𝑄 𝑐, 𝑘

Burglary Alarm John Calls =True Earthquake Mary Calls X Y Z

slide-39
SLIDE 39

Variable elimination algorithm

39

} Given: BN, evidence 𝑓, a query 𝑄(𝒁|𝒚𝒘) } Prune non-ancestors of {𝒁, 𝒀𝑾} } Choose an ordering on variables, e.g., 𝑌S, …, 𝑌] } For i = 1 to n, If 𝑌^ ∉ {𝒁, 𝒀𝑾}

} Collect factors 𝒈S, … , 𝒈a that include 𝑌^ } Generate a new factor by eliminating 𝑌^ from these factors:

𝒉 = E c 𝒈d

a deS fg

} Multiply all remaining factors } Normalize 𝑄(𝒁, 𝒚𝒘) to obtain 𝑄(𝒁|𝒚𝒘)

After this summation, 𝑌^ is eliminated

slide-40
SLIDE 40

Variable elimination algorithm

40

  • Evaluating expressions in a proper order
  • Storing intermediate results
  • Summation only for those portions of the expression that

depend on that variable } Given: BN, evidence 𝑓, a query 𝑄(𝒁|𝒚𝒘) } Prune non-ancestors of {𝒁, 𝒀𝑾} } Choose an ordering on variables, e.g., 𝑌S, …, 𝑌] } For i = 1 to n, If 𝑌^ ∉ {𝒁, 𝒀𝑾}

} Collect factors 𝒈S, … , 𝒈a that include 𝑌^ } Generate a new factor by eliminating 𝑌^ from these factors:

𝒉 = E c 𝒈d

a deS fg

} Normalize 𝑄(𝒁, 𝒚𝒘) to obtain 𝑄(𝒁|𝒚𝒘)

slide-41
SLIDE 41

Variable elimination

41

} Eliminates by summation non-observed non-query variables

  • ne by one by distributing the sum over the product

} Complexity determined by the size of the largest factor } Variable elimination can lead to significant costs saving but its

efficiency depends on the network structure .

} there are still cases in which this algorithm we lead to exponential time.

slide-42
SLIDE 42

Improvement reasons

42

} Computing an expression of the form (sum-product inference):

E c 𝜚

  • i∈𝜲

𝒂

} We used the structure of BN to factorize the joint distribution and

thus the scope of the resulted factors will be limited.

} Distributive law: If 𝑌 ∉ Scope(𝜚S) then ∑ 𝜚S

  • f

. 𝜚R = 𝜚S. ∑ 𝜚R

  • f

} Performing the summations over the product of only a subset of factors

} We find sub-expressions that can be computed once and then we

save and reuse them in later computations

} Instead of computing them exponentially many times

𝜲: the set of factors

slide-43
SLIDE 43

Example

Choose A

43

slide-44
SLIDE 44

Example (Cont.)

Choose E Finish with B Normalize

44

slide-45
SLIDE 45

Same Example in Equations

marginal can be obtained from joint by summing out use Bayes’ net joint distribution expression use x*(y+z) = xy + xz joining on a, and then summing out gives f1 use x*(y+z) = xy + xz joining on e, and then summing out gives f2

All we are doing is exploiting uwy + uwz + uxy + uxz + vwy + vwz + vxy +vxz = (u+v)(w+x)(y+z) to improve computational efficiency!

45

slide-46
SLIDE 46

Inference on a chain

46

𝑄 𝑒 = E E E 𝑄(𝐵, 𝐶, 𝐷, 𝑒)

t

  • ,

𝑄 𝑒 = E E E 𝑄 𝐵 𝑄 𝐶 𝐵 𝑄 𝐷 𝐶 𝑄(𝑒|𝐷)

t

  • ,

} A

naïve summation needs to enumerate

  • ver

an exponential number of terms

𝐵 𝐶 𝐷 𝐸

slide-47
SLIDE 47

Inference on a chain: marginalization and elimination

47

𝑄 𝑒 = E E E 𝑄 𝐵 𝑄 𝐶 𝐵 𝑄 𝐷 𝐶 𝑄(𝑒|𝐷)

t

  • ,

= E E E 𝑄 𝐵 𝑄 𝐶 𝐵 𝑄 𝐷 𝐶 𝑄(𝑒|𝐷)

,

  • t

= E 𝑄(𝑒|𝐷) E 𝑄 𝐷 𝐶 E 𝑄 𝐵

,

  • 𝑄 𝐶 𝐵

t

} In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙R) instead of 𝑃(𝑙])

𝑔(𝐶) 𝑔(𝐷) 𝐵 𝐶 𝐷 𝐸

slide-48
SLIDE 48

Wampus example

48

𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 = ¬𝑐S,S ∧ 𝑐S,R ∧ 𝑐R,S ∧ ¬𝑞S,S ∧ ¬𝑞S,R ∧ ¬𝑞R,S 𝑄 𝑄

S,N 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 =?

slide-49
SLIDE 49

Wumpus example

49

Possible worlds with 𝑄

S,N = 𝑢𝑠𝑣𝑓

Possible worlds with 𝑄

S,N = 𝑔𝑏𝑚𝑡𝑓

𝑄 𝑄

S,N = 𝑈𝑠𝑣𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 ∝ 0.2 × 0.2×0.2 + 0.2×0.8 + 0.8×0.2

𝑄 𝑄

S,N = 𝐺𝑏𝑚𝑡𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 ∝ 0.8 × 0.2×0.2 + 0.2×0.8

⇒ 𝑄 𝑄

S,N = 𝑈𝑠𝑣𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 = 0.31

slide-50
SLIDE 50

Another Variable Elimination Example

Computational complexity critically depends on the largest factor being generated in this process. Size of factor = number of entries in table. In example above (assuming binary) all factors generated are of size 2 --- as they all only have one variable (Z, Z, and X3 respectively). 50

slide-51
SLIDE 51

Variable Elimination Ordering

}

For the query P(Xn|y1,…,yn) work through the following two different

  • rderings as done in previous slide: Z, X1, …, Xn-1 and X1, …, Xn-1, Z.

What is the size of the maximum factor generated for each of the

  • rderings?

}

Answer: 2n+1 versus 22 (assuming binary)

}

In general: the ordering can greatly affect efficiency.

… … 51

slide-52
SLIDE 52

VE: Computational and Space Complexity

} The

computational and space complexity

  • f

variable elimination is determined by the largest factor

} The elimination ordering can greatly affect the size of the largest factor.

}

E.g., previous slide’s example 2n vs. 2

} Does there always exist an ordering that only results in small factors?

}

No!

52

slide-53
SLIDE 53

Variable elimination algorithm

53

} Sum out each variable one at a time

} all factors containing that variable are (removed from the set

  • f factors and) multiplied to generate a product factor

} The variable is summed out from the generated product factor

and a new factor is obtained

} The new factor is added to the set of the available factors

The resulted factor does not necessarily correspond to any probability or conditional probability in the network

slide-54
SLIDE 54

54

Procedure Sum-Product-VE (Z,G)

// 𝒂: the variables to be eliminated

𝚾 ←all factors of G Select an elimination order 𝑎S, . . . , 𝑎† for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎^)) 𝜚∗ ← c 𝜚

  • i∈𝜲

Return 𝜚∗ Procedure Sum-Product-Elim- Var( 𝚾, 𝑎) 𝚾Š ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾ŠŠ ← 𝚾 − 𝚾Š 𝑛 ← E c 𝜚

  • i∈𝚾Œ
  • return 𝚾ŠŠ ∪ {𝑛}
  • Move all irrelevant factors (to the

variable that must be eliminated now) outside of the summation

  • Perform sum, getting a new term
  • Insert the new term into the product

It does not need normalization when we have no evidence

slide-55
SLIDE 55

55

Procedure Cond-Prob-VE ( 𝒧, // the network over 𝒀 𝒁, // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ←the factors parametrizing 𝒧 Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎S, . . . , 𝑎† for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎^)) 𝜚∗ ← c 𝜚

  • i∈𝜲

𝛽 ← E 𝜚∗(𝒛)

𝒛∈–—˜(𝒁)

Return 𝛽, 𝜚∗

slide-56
SLIDE 56

Complexity of variable elimination algorithm

56

} In each elimination step, the following computations are

required:

} 𝑔 𝑦, 𝑦S, … , 𝑦a = ∏

𝑕^(𝑦, 𝒚›g)

G ^eS

} ∑ 𝑔 𝑦, 𝑦S, … , 𝑦a

œ

} We need:

} (𝑁 − 1)× 𝑊𝑏𝑚(𝑌) × ∏

𝑊𝑏𝑚(𝑌^)

a ^eS

multiplications

} For each tuple 𝑦, 𝑦S, … , 𝑦a, we need 𝑁 − 1 multiplications

}

𝑊𝑏𝑚(𝑌) × ∏ 𝑊𝑏𝑚(𝑌^)

a ^eS

additions

} For each tuple 𝑦S, … , 𝑦a, we need 𝑊𝑏𝑚(𝑌) additions

Complexity is exponential in number of variables in the intermediate factor Size of the created factors is the dominant quantity in the complexity of VE

slide-57
SLIDE 57

Example

57 } Query: 𝑄(𝑌R|𝑌M = 𝑦̅M) } 𝑄 𝑌R 𝑦̅M ∝ 𝑄 𝑌R, 𝑦̅M

𝑄 𝑦R, 𝑦̅M = E E E E E E 𝑄 𝑦S, 𝑦R, 𝑦N, 𝑦K, 𝑦T, 𝑦O, 𝑦̅M, 𝑦Q

œŸ œ œ¡ œ¢ œ£ œ¤

Consider the elimination order 𝑌S, 𝑌N, 𝑌K, 𝑌T, 𝑌O, 𝑌Q

𝑄 𝑦R, 𝑦̅M = E E E E E E 𝑄 𝑦S 𝑄 𝑦R 𝑄 𝑦N 𝑦S, 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄(𝑦̅M|𝑦K, 𝑦T)𝑄 𝑦Q 𝑦̅M

œ¤ œ£ œ¢ œ¡ œ œŸ

𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q

slide-58
SLIDE 58

58 𝑄 𝑦R, 𝑦̅M = E E E E E 𝑄 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄(𝑦̅M|𝑦K, 𝑦T)𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦S 𝑄 𝑦N 𝑦S, 𝑦R

œ¤ œ£ œ¢ œ¡ œ œŸ

= E E E E E 𝑄 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M 𝑛S(𝑦R, 𝑦N)

œ£ œ¢ œ¡ œ œŸ

= E E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦K 𝑦N 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑛S(𝑦R, 𝑦N)

œ£ œ¢ œ¡ œ œŸ

= E E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M 𝑛N(𝑦R, 𝑦O, 𝑦K)

œ¢ œ¡ œ œŸ

= E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑛N(𝑦R, 𝑦O, 𝑦K)

œ¢ œ¡ œ œŸ

= E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦Q 𝑦̅M 𝑛K(𝑦R, 𝑦T, 𝑦O)

œ¡ œ œŸ

= E E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦T 𝑦R 𝑛K(𝑦R, 𝑦T, 𝑦O)

œ¡ œ œŸ

= E E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M 𝑛T(𝑦R, 𝑦O)

œ œŸ

= E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑛T(𝑦R, 𝑦O)

œ œŸ

= E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M

œŸ

𝑛O(𝑦R) = 𝑛Q(𝑦R)𝑛O(𝑦R)

slide-59
SLIDE 59

Conditional probability

59

𝑄 𝑦R|𝑦̅M = 𝑛Q(𝑦R)𝑛O(𝑦R) ∑ 𝑛Q(𝑦R)𝑛O(𝑦R)

ϴ

slide-60
SLIDE 60

Graph elimination

60

} Graph elimination is a simple unified treatment of inference

algorithms

} Moralize the graph

} All parents of a node are connected to each other

} Graph-theoretic property: the factors resulted during variable

elimination are captured by recording the elimination clique

} The computational complexity of the Eliminate algorithm can

be reduced to purely graph-theoretic considerations

slide-61
SLIDE 61

Graph elimination

61

} Begin with the moralized BN } Choose an elimination ordering (query nodes should be last) } Eliminate a node from the graph and add edges (called fill

edges) between all pairs of its neighbors

} Iterate until all non-query nodes are eliminated

slide-62
SLIDE 62

Graph elimination

62

𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q 𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q 𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌O 𝑌Q 𝑌R 𝑌Q 𝑌R Removing a node from the graph and connecting the remaining neighbors Moralized graph

Summation ⇔ elimination Intermediate term ⇔ elimination clique

fill edges

slide-63
SLIDE 63

Graph elimination: elimination cliques

63

} Induced dependency during marginalization is captured in

elimination cliques

} A correspondence between maximal cliques in the induced

graph and maximal factors generated inVE algorithm

} The complexity depends on the number of variables in the largest

elimination clique

} The size of the maximal elimination clique in the induced

graph depends on the elimination ordering

slide-64
SLIDE 64

Elimination order

64

} Finding the best elimination ordering is NP-hard

} Equivalent to finding the tree-width in the graph that is NP-

hard

} Tree-width: one less than the smallest achievable size of the

largest elimination clique, ranging over all possible elimination

  • rdering

} Good elimination orderings lead to small cliques and

thus reduce complexity

} What is the optimal order for trees?

slide-65
SLIDE 65

Polytrees

} A polytree is a directed graph with no undirected cycles } For poly-trees you can always find an ordering that is efficient

}

Try it!!

} Cut-set conditioning for Bayes’ net inference

}

Choose set of variables such that if removed only a polytree remains

}

Exercise:Think about how the specifics would work out!

65

slide-66
SLIDE 66

Worst Case Complexity?

}

CSP:

}

If we can answer P(z) equal to zero or not, we answered whether the 3-SAT problem has a solution.

}

Hence inference in Bayes’ nets is NP-hard. No known efficient probabilistic inference in general.

… …

66

slide-67
SLIDE 67

Variable Elimination: summary

} Interleave joining and marginalizing } dk entries computed for a factor over k

variables with domain sizes d

} Ordering of elimination of variables

can affect size of factors generated

} Worst case: running time exponential

in the size of the Bayes’ net

… … 67

slide-68
SLIDE 68

Bayes’ Nets

}

Representation

}

Conditional Independences

}

Probabilistic Inference

}

Enumeration (exact, exponential complexity)

}

Variable elimination (exact, worst-case exponential complexity, often better)

}

Inference is NP-complete

}

Sampling (approximate)

}

Learning Bayes’ Nets from Data

68

slide-69
SLIDE 69

Approximate Inference: Sampling

69

slide-70
SLIDE 70

Sampling

} Sampling is a lot like repeated simulation

}

Predicting the weather, basketball games, …

} Basic idea

}

Draw N samples from a sampling distribution S

}

Compute an approximate posterior probability

}

Show this converges to the true probability P

Ø Why sample?

§ Learning: get samples from a distribution you don’t know § Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)

70

slide-71
SLIDE 71

Sampling

} Sampling from given distribution

}

Step 1: Get sample u from uniform distribution over [0, 1)

}

E.g. random() in python

}

Step 2: Convert this sample u into an outcome for the given distribution by having each

  • utcome associated with a sub-interval of [0,1) with sub-interval size equal to probability
  • f the outcome

§ If random() returns u = 0.83, then

  • ur sample is C = blue

§ E.g, after sampling 8 times:

C P(C) red 0.6 green 0.1 blue 0.3

71

slide-72
SLIDE 72

Sampling in Bayes’ Nets

} Prior Sampling } Rejection Sampling } Likelihood Weighting } Gibbs Sampling

72

slide-73
SLIDE 73

Prior Sampling

73

slide-74
SLIDE 74

Prior Sampling

Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

+c 0.5

  • c

0.5 +c +s 0.1

  • s

0.9

  • c

+s 0.5

  • s

0.5 +c +r 0.8

  • r

0.2

  • c

+r 0.2

  • r

0.8 +s +r +w 0.99

  • w

0.01

  • r

+w 0.90

  • w

0.10

  • s

+r +w 0.90

  • w

0.10

  • r

+w 0.01

  • w

0.99

Samples: +c, -s, +r, +w

  • c, +s, -r, +w

74

slide-75
SLIDE 75

Prior Sampling

} For i=1, 2, …, n

} Sample xi from P(Xi | Parents(Xi))

} Return (x1, x2, …, xn)

75

slide-76
SLIDE 76

Prior Sampling

} This process generates samples with probability:

…i.e. the BN’s joint probability

} Let the number of samples of an event be } Then } I.e., the sampling procedure is consistent

76

slide-77
SLIDE 77

Example

} We’ll get a bunch of samples from the BN:

+c, -s, +r, +w +c, +s, +r, +w

  • c, +s, +r, -w

+c, -s, +r, +w

  • c, -s, -r, +w

} If we want to know P(W)

} We have counts <+w:4, -w:1> } Normalize to get P(W) = <+w:0.8, -w:0.2> } This will get closer to the true distribution with more samples } Can estimate anything else, too } What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? } Fast: can use fewer samples if less time (what’s the drawback?)

S R W C

77

slide-78
SLIDE 78

Rejection Sampling

78

slide-79
SLIDE 79

+c, -s, +r, +w +c, +s, +r, +w

  • c, +s, +r, -w

+c, -s, +r, +w

  • c, -s, -r, +w

Rejection Sampling

} Let’s say we want P(C)

} No point keeping all samples around } Just tally counts of C as we go

} Let’s say we want P(C| +s)

} Same thing: tally C outcomes, but ignore

(reject) samples which don’t have S=+s

} This is called rejection sampling } It is also consistent for conditional probabilities

(i.e., correct in the limit) S R W C

79

slide-80
SLIDE 80

Rejection Sampling

}

IN: evidence instantiation

}

For i=1, 2, …, n

}

Sample xi from P(Xi | Parents(Xi))

}

If xi not consistent with evidence

}

Reject: Return, and no sample is generated in this cycle }

Return (x1, x2, …, xn)

80

slide-81
SLIDE 81

Likelihood Weighting

81

slide-82
SLIDE 82

§ Idea: fix evidence variables and sample the rest

§ Problem: sample distribution not consistent! § Solution: weight by probability of evidence given parents

Likelihood Weighting

} Problem with rejection sampling:

}

If evidence is unlikely, rejects lots of samples

}

Evidence not exploited as you sample

}

Consider P(Shape|blue)

Shape Color Shape Color

pyramid, green pyramid, red sphere, blue cube, red sphere, green pyramid, blue pyramid, blue sphere, blue cube, blue sphere, blue

82

slide-83
SLIDE 83

Likelihood Weighting

+c 0.5

  • c

0.5 +c +s 0.1

  • s

0.9

  • c

+s 0.5

  • s

0.5 +c +r 0.8

  • r

0.2

  • c

+r 0.2

  • r

0.8 +s +r +w 0.99

  • w

0.01

  • r

+w 0.90

  • w

0.10

  • s

+r +w 0.90

  • w

0.10

  • r

+w 0.01

  • w

0.99

Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

83

slide-84
SLIDE 84

Likelihood Weighting

}

IN: evidence instantiation

}

w = 1.0

}

for i=1, 2, …, n

}

if Xi is an evidence variable

}

Xi = observation xi for Xi

}

Set w = w * P(xi | Parents(Xi))

}

else

}

Sample xi from P(Xi | Parents(Xi))

}

return (x1, x2, …, xn), w

84

slide-85
SLIDE 85

Likelihood Weighting

} Sampling distribution if z sampled and e fixed evidence } Now, samples have weights } Together, weighted sampling distribution is consistent

Cloudy R C S W

85

slide-86
SLIDE 86

Likelihood Weighting

} Likelihood weighting is good

}

We have taken evidence into account as we generate the sample

}

E.g. here, W’s value will get picked based on the evidence values of S, R

}

More of our samples will reflect the state of the world suggested by the evidence

} Likelihood weighting doesn’t solve all our problems

}

Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)

} We would like to consider evidence when we sample every variable

à Gibbs sampling

86

slide-87
SLIDE 87

Gibbs Sampling

87

slide-88
SLIDE 88

Gibbs Sampling

} Procedure: keep track of a full instantiation x1, x2, …, xn.

}

Start with an arbitrary instantiation consistent with the evidence.

}

Sample one variable at a time, conditioned on all the rest, but keep evidence fixed.

}

Keep repeating this for a long time.

} Property: in the limit of repeating this infinitely many times the resulting sample is

coming from the correct distribution

} Rationale: both upstream and downstream variables condition on evidence. } In contrast: likelihood weighting only conditions on upstream evidence, and hence

weights obtained in likelihood weighting can sometimes be very small.

}

Sum of weights over all samples is indicative of how many “effective” samples were obtained, so want high weight.

88

slide-89
SLIDE 89

Gibbs Sampling Example: P( S | +r)

} Step 1: Fix evidence

}

R = +r

} Step 2: Initialize other variables

}

Randomly

} Steps 3: Repeat

}

Choose a non-evidence variable X

}

Resample X from P( X | all other variables)

S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C

89

slide-90
SLIDE 90

Gibbs Sampling

} How is this better than sampling from the full joint?

} In a Bayes’ Net, sampling a variable given all the other variables

(e.g. P(R|S,C,W)) is usually much easier than sampling from the full joint distribution

} Only requires a join on the variable to be sampled (in this case, a join on R) } The resulting factor only depends on the variable’s parents, its children, and its children’s

parents (this is often referred to as its Markov blanket)

90

slide-91
SLIDE 91

Efficient Resampling of One Variable

}

Sample from P(S | +c, +r, -w)

} Many things cancel out – only CPTs with S remain! } More generally: only CPTs that have resampled variable need to be

considered, and joined together

S +r W C

91

slide-92
SLIDE 92

Bayes’ Net Sampling Summary

§

Prior Sampling P

§

LikelihoodWeighting P( Q | e) § Rejection Sampling P( Q | e ) § Gibbs Sampling P( Q | e )

92

slide-93
SLIDE 93

Further Reading on Gibbs Sampling*

} Gibbs sampling produces sample from the query distribution P(Q|e)

in limit of re-sampling infinitely often

} Gibbs sampling is a special case of more general methods called

Markov chain Monte Carlo (MCMC) methods

} Metropolis-Hastings is one of the more famous MCMC methods (in fact, Gibbs

sampling is a special case of Metropolis-Hastings)

} You may read about Monte Carlo methods – they’re just sampling

93