Inference in Bayesian Networks
CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019
Soleymani
Slides are based on Klein and Abdeel, CS188, UC Berkeley.
Inference in Bayesian Networks CE417: Introduction to Artificial - - PowerPoint PPT Presentation
Inference in Bayesian Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Slides are based on Klein and Abdeel, CS188, UC Berkeley. Bayes Nets } Representation } Conditional
CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019
Soleymani
Slides are based on Klein and Abdeel, CS188, UC Berkeley.
} Representation } Conditional Independences } Probabilistic Inference
} Enumeration (exact, exponential complexity) } Variable elimination (exact, worst-case
exponential complexity, often better)
} Probabilistic inference is NP-complete } Sampling (approximate)
} Learning Bayes’ Nets from Data
2
} A directed, acyclic graph, one node per random
variable
} A conditional probability table (CPT) for each node
}
A collection of distributions over X, one for each combination of parents’ values
} Bayes’ nets implicitly encode joint distributions
}
As a product of local conditional distributions
}
To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together:
3
Burglary Earthqk Alarm John calls Mary calls
B P(B) +b 0.001
0.999 E P(E) +e 0.002
0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e
0.05 +b
+a 0.94 +b
0.06
+e +a 0.29
+e
0.71
+a 0.001
0.999 A J P(J|A) +a +j 0.9 +a
0.1
+j 0.05
0.95 A M P(M|A) +a +m 0.7 +a
0.3
+m 0.01
0.99
[Demo: BN Applet]
4
5
B P(B) +b 0.001
0.999 E P(E) +e 0.002
0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e
0.05 +b
+a 0.94 +b
0.06
+e +a 0.29
+e
0.71
+a 0.001
0.999 A J P(J|A) +a +j 0.9 +a
0.1
+j 0.05
0.95 A M P(M|A) +a +m 0.7 +a
0.3
+m 0.01
0.99
B E A M J
6
B P(B) +b 0.001
0.999 E P(E) +e 0.002
0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e
0.05 +b
+a 0.94 +b
0.06
+e +a 0.29
+e
0.71
+a 0.001
0.999 A J P(J|A) +a +j 0.9 +a
0.1
+j 0.05
0.95 A M P(M|A) +a +m 0.7 +a
0.3
+m 0.01
0.99
B E A M J
7
}
Representation
}
Conditional Independences
}
Probabilistic Inference
}
Enumeration (exact, exponential complexity)
}
Variable elimination (exact, worst-case exponential complexity, often better)
}
Inference is NP-complete
}
Sampling (approximate)
}
Learning Bayes’ Nets from Data
8
§ Examples:
§ Posterior probability § Most likely explanation:
} Inference:
calculating some useful quantity from a joint probability distribution
9
}
General case:
}
Evidence variables:
}
Query* variable:
}
Hidden variables:
All variables
* Works fine with multiple query variables, too
§ We want: § Step 1: Select the entries consistent with the evidence § Step 2: Sum out H to get joint
§ Step 3: Normalize
10
} Given unlimited time, inference in BNs is easy } Reminder of inference by enumeration by example:
B E A M J P(B | + j, +m) ∝B P(B, +j, +m)
= X
e,a
P(B, e, a, +j, +m)
= X
e,a
P(B)P(e)P(a|B, e)P(+j|a)P(+m|a)
=P(B)P(+e)P(+a|B, +e)P(+j| + a)P(+m| + a) + P(B)P(+e)P(−a|B, +e)P(+j| − a)P(+m| − a) P(B)P(−e)P(+a|B, −e)P(+j| + a)P(+m| + a) + P(B)P(−e)P(−a|B, −e)P(+j| − a)P(+m| − a)
11
12
𝑄 𝑐 𝑘, ¬𝑛 = 𝑄 𝑘, ¬𝑛, 𝑐 𝑄 𝑘, ¬𝑛 = ∑ ∑ 𝑄 𝑘, ¬𝑛, 𝑐, 𝐵, 𝐹
+ ,
∑ ∑ ∑ 𝑄 𝑘, ¬𝑛, 𝑐, 𝐵, 𝐹
+ ,
∑ ∑ 𝑄 𝑘 𝐵 𝑄 ¬𝑛 𝐵 𝑄 𝐵 𝑐, 𝐹 𝑄 𝑐 𝑄(𝐹)
+ ,
∑ ∑ ∑ 𝑄 𝑘 𝐵 𝑄 ¬𝑛 𝐵 𝑄 𝐵 𝐶, 𝐹 𝑄 𝐶 𝑄(𝐹)
+ ,
¬𝑐: 𝐶𝑣𝑠𝑚𝑏𝑠𝑧 = 𝐺𝑏𝑚𝑡𝑓 … Short-hands
13
14
} Joint distribution: P(X,Y)
}
Entries P(x,y) for all x, y
}
Sums to 1
} Selected joint: P(x,Y)
}
A slice of the joint distribution
}
Entries P(x,y) for fixed x, all y
}
Sums to P(x)
} Number of capitals = dimensionality of the table T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 T W P cold sun 0.2 cold rain 0.3 15
} Single conditional: P(Y | x)
}
Entries P(y | x) for fixed x, all y
}
Sums to 1
} Family of conditionals:
P(X |Y)
}
Multiple conditionals
}
Entries P(x | y) for all x, y
}
Sums to |Y|
T W P hot sun 0.8 hot rain 0.2 cold sun 0.4 cold rain 0.6 T W P cold sun 0.4 cold rain 0.6 16
} Specified family: P( y | X )
}
Entries P(y | x) for fixed y, but for all x
}
Sums to … who knows!
T W P hot rain 0.2 cold rain 0.6 17
§ It is a “factor,” a multi-dimensional array § Its values are P(y1 … yN | x1 … xM) § Any assigned (=lower-case) X or Y is a dimension missing (selected) from the array
18
} RandomVariables
} R: Raining } T:Traffic } L: Late for class!
T L R
+r 0.1
0.9 +r +t 0.8 +r
0.2
+t 0.1
0.9 +t +l 0.3 +t
0.7
+l 0.1
0.9
= X
r,t
P(r, t, L)
= X
r,t
P(r)P(t|r)P(L|t)
19
} Track objects called factors } Initial factors are local CPTs (one per node) } Any known values are selected
} E.g. if we know
, the initial factors are } Procedure: Join all factors, then eliminate all hidden variables
+r 0.1
0.9 +r +t 0.8 +r
0.2
+t 0.1
0.9 +t +l 0.3 +t
0.7
+l 0.1
0.9 +t +l 0.3
+l 0.1 +r 0.1
0.9 +r +t 0.8 +r
0.2
+t 0.1
0.9
20
} First basic operation: joining factors } Combining factors:
}
Just like a database join
}
Get all factors over the joining variable
}
Build a new factor over the union of the variables involved
} Example: Join on R
}
Computation for each entry: pointwise products
+r 0.1
0.9 +r +t 0.8 +r
0.2
+t 0.1
0.9 +r +t 0.08 +r
0.02
+t 0.09
0.81
T R R,T
21
22
T R
Join R
L R, T L
+r 0.1
0.9 +r +t 0.8 +r
+t 0.1
+t +l 0.3 +t
0.7
+l 0.1
0.9 +r +t 0.08 +r
0.02
+t 0.09
0.81 +t +l 0.3 +t
0.7
+l 0.1
0.9
R, T, L
+r +t +l
0.024
+r +t
0.056
+r
+l
0.002
+r
0.018
+t +l
0.027
+t
0.063
+l
0.081
0.729
Join T
23
} Second basic operation: marginalization } Take a factor and sum out a variable
}
Shrinks a factor to a smaller one
}
A projection operation
} Example: +r +t 0.08 +r
0.02
+t 0.09
0.81 +t 0.17
0.83
24
Sum
Sum
T, L L R, T, L
+r +t +l
0.024
+r +t
0.056
+r
+l
0.002
+r
0.018
+t +l
0.027
+t
0.063
+l
0.081
0.729 +t +l
0.051
+t
0.119
+l
0.083
0.747
+l 0.134
0.886
25
26
} Why is inference by enumeration so slow?
}
You join up the whole joint distribution before you sum out the hidden variables
§ Idea: interleave joining and marginalizing!
§ Called “Variable Elimination” § Still NP-hard, but usually much faster than inference by enumeration § First we’ll need some new notation: factors
27
} Inference by Enumeration
T L R
= X
t
P(L|t) X
r
P(r)P(t|r) Join on r Join on r Join on t Join on t Eliminate r Eliminate t Eliminate r
= X
t
X
r
P(L|t)P(r)P(t|r)
Eliminate t
28
29
Sum out R
T L
+r +t 0.08 +r
0.02
+t 0.09
0.81 +t +l 0.3 +t
0.7
+l 0.1
0.9 +t 0.17
0.83 +t +l 0.3 +t
0.7
+l 0.1
0.9
T R L
+r 0.1
0.9 +r +t 0.8 +r
+t 0.1
+t +l 0.3 +t
0.7
+l 0.1
0.9 Join R
R, T L T, L L
+t +l
0.051
+t
0.119
+l
0.083
0.747
+l 0.134
0.866 Join T Sum out T
30
} If evidence, start with factors that select that evidence
} No evidence uses these initial factors: } Computing
, the initial factors become:
} We eliminate all vars other than query + evidence
+r 0.1
0.9 +r +t 0.8 +r
0.2
+t 0.1
0.9 +t +l 0.3 +t
0.7
+l 0.1
0.9 +r 0.1 +r +t 0.8 +r
0.2 +t +l 0.3 +t
0.7
+l 0.1
0.9
31
} Result will be a selected joint of query and evidence
}
E.g. for P(L | +r), we would end up with:
} To get our answer, just normalize this! } That ’s it! +l 0.26
0.74 +r +l 0.026 +r
0.074 Normalize
32
} Query: } Start with initial factors:
}
Local CPTs (but instantiated by evidence)
} While there are still hidden variables (not Q
}
Pick a hidden variable H
}
Join all factors mentioning H
}
Eliminate (sum out) H
} Join all remaining factors and normalize
33
34
} Exploiting the factorization properties to allow sums and
} 𝑏×𝑐 + 𝑏×𝑑 needs three operations while 𝑏×(𝑐 + 𝑑) requires
35
G + ,
G , +
Intermediate results are probability distributions
36
G + ,
G , +
𝒈K(𝐵) 1 1 𝒈M 𝐶, 𝐹 = E 𝒈N(𝐵, 𝐶, 𝐹)×𝒈K(𝐵) ×𝒈O(𝐵)
𝒈Q 𝐶 = E 𝒈R(𝐹)×𝒈M 𝐶, 𝐹
𝒈N(𝐵, 𝐶, 𝐹) 𝒈S(𝐶) 𝒈R(𝐹) 𝒈T(𝐵, 𝑁) 𝒈O(𝐵)
Intermediate results are probability distributions
37
} An inefficient order:
, + G
, + G
𝒈(𝐵, 𝐶, 𝐹, 𝑁)
38
} Any variable that is not an ancestor of a query variable or
} Prune all non-ancestors of query or evidence variables: } 𝑄 𝑐, 𝑘
Burglary Alarm John Calls =True Earthquake Mary Calls X Y Z
39
} Given: BN, evidence 𝑓, a query 𝑄(𝒁|𝒚𝒘) } Prune non-ancestors of {𝒁, 𝒀𝑾} } Choose an ordering on variables, e.g., 𝑌S, …, 𝑌] } For i = 1 to n, If 𝑌^ ∉ {𝒁, 𝒀𝑾}
} Collect factors 𝒈S, … , 𝒈a that include 𝑌^ } Generate a new factor by eliminating 𝑌^ from these factors:
a deS fg
} Multiply all remaining factors } Normalize 𝑄(𝒁, 𝒚𝒘) to obtain 𝑄(𝒁|𝒚𝒘)
After this summation, 𝑌^ is eliminated
40
depend on that variable } Given: BN, evidence 𝑓, a query 𝑄(𝒁|𝒚𝒘) } Prune non-ancestors of {𝒁, 𝒀𝑾} } Choose an ordering on variables, e.g., 𝑌S, …, 𝑌] } For i = 1 to n, If 𝑌^ ∉ {𝒁, 𝒀𝑾}
} Collect factors 𝒈S, … , 𝒈a that include 𝑌^ } Generate a new factor by eliminating 𝑌^ from these factors:
a deS fg
} Normalize 𝑄(𝒁, 𝒚𝒘) to obtain 𝑄(𝒁|𝒚𝒘)
41
} Eliminates by summation non-observed non-query variables
} Complexity determined by the size of the largest factor } Variable elimination can lead to significant costs saving but its
} there are still cases in which this algorithm we lead to exponential time.
42
} Computing an expression of the form (sum-product inference):
𝒂
} We used the structure of BN to factorize the joint distribution and
} Distributive law: If 𝑌 ∉ Scope(𝜚S) then ∑ 𝜚S
} Performing the summations over the product of only a subset of factors
} We find sub-expressions that can be computed once and then we
} Instead of computing them exponentially many times
𝜲: the set of factors
Choose A
43
Choose E Finish with B Normalize
44
marginal can be obtained from joint by summing out use Bayes’ net joint distribution expression use x*(y+z) = xy + xz joining on a, and then summing out gives f1 use x*(y+z) = xy + xz joining on e, and then summing out gives f2
All we are doing is exploiting uwy + uwz + uxy + uxz + vwy + vwz + vxy +vxz = (u+v)(w+x)(y+z) to improve computational efficiency!
45
46
t
t
} A
𝐵 𝐶 𝐷 𝐸
47
𝑄 𝑒 = E E E 𝑄 𝐵 𝑄 𝐶 𝐵 𝑄 𝐷 𝐶 𝑄(𝑒|𝐷)
t
= E E E 𝑄 𝐵 𝑄 𝐶 𝐵 𝑄 𝐷 𝐶 𝑄(𝑒|𝐷)
,
= E 𝑄(𝑒|𝐷) E 𝑄 𝐷 𝐶 E 𝑄 𝐵
,
t
} In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙R) instead of 𝑃(𝑙])
𝑔(𝐶) 𝑔(𝐷) 𝐵 𝐶 𝐷 𝐸
48
𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 = ¬𝑐S,S ∧ 𝑐S,R ∧ 𝑐R,S ∧ ¬𝑞S,S ∧ ¬𝑞S,R ∧ ¬𝑞R,S 𝑄 𝑄
S,N 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 =?
49
Possible worlds with 𝑄
S,N = 𝑢𝑠𝑣𝑓
Possible worlds with 𝑄
S,N = 𝑔𝑏𝑚𝑡𝑓
𝑄 𝑄
S,N = 𝑈𝑠𝑣𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 ∝ 0.2 × 0.2×0.2 + 0.2×0.8 + 0.8×0.2
𝑄 𝑄
S,N = 𝐺𝑏𝑚𝑡𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 ∝ 0.8 × 0.2×0.2 + 0.2×0.8
⇒ 𝑄 𝑄
S,N = 𝑈𝑠𝑣𝑓 𝑓𝑤𝑗𝑒𝑓𝑜𝑑𝑓 = 0.31
Computational complexity critically depends on the largest factor being generated in this process. Size of factor = number of entries in table. In example above (assuming binary) all factors generated are of size 2 --- as they all only have one variable (Z, Z, and X3 respectively). 50
}
For the query P(Xn|y1,…,yn) work through the following two different
What is the size of the maximum factor generated for each of the
}
Answer: 2n+1 versus 22 (assuming binary)
}
In general: the ordering can greatly affect efficiency.
… … 51
} The
computational and space complexity
variable elimination is determined by the largest factor
} The elimination ordering can greatly affect the size of the largest factor.
}
E.g., previous slide’s example 2n vs. 2
} Does there always exist an ordering that only results in small factors?
}
No!
52
53
} Sum out each variable one at a time
} all factors containing that variable are (removed from the set
} The variable is summed out from the generated product factor
} The new factor is added to the set of the available factors
The resulted factor does not necessarily correspond to any probability or conditional probability in the network
54
Procedure Sum-Product-VE (Z,G)
// 𝒂: the variables to be eliminated
𝚾 ←all factors of G Select an elimination order 𝑎S, . . . , 𝑎† for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎^)) 𝜚∗ ← c 𝜚
Return 𝜚∗ Procedure Sum-Product-Elim- Var( 𝚾, 𝑎) 𝚾Š ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾ŠŠ ← 𝚾 − 𝚾Š 𝑛 ← E c 𝜚
variable that must be eliminated now) outside of the summation
It does not need normalization when we have no evidence
55
Procedure Cond-Prob-VE ( , // the network over 𝒀 𝒁, // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ←the factors parametrizing Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎S, . . . , 𝑎† for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎^)) 𝜚∗ ← c 𝜚
𝛽 ← E 𝜚∗(𝒛)
𝒛∈–—˜(𝒁)
Return 𝛽, 𝜚∗
56
} In each elimination step, the following computations are
} 𝑔 𝑦, 𝑦S, … , 𝑦a = ∏
G ^eS
} ∑ 𝑔 𝑦, 𝑦S, … , 𝑦a
œ
} We need:
} (𝑁 − 1)× 𝑊𝑏𝑚(𝑌) × ∏
a ^eS
} For each tuple 𝑦, 𝑦S, … , 𝑦a, we need 𝑁 − 1 multiplications
}
a ^eS
} For each tuple 𝑦S, … , 𝑦a, we need 𝑊𝑏𝑚(𝑌) additions
Complexity is exponential in number of variables in the intermediate factor Size of the created factors is the dominant quantity in the complexity of VE
57 } Query: 𝑄(𝑌R|𝑌M = 𝑦̅M) } 𝑄 𝑌R 𝑦̅M ∝ 𝑄 𝑌R, 𝑦̅M
𝑄 𝑦R, 𝑦̅M = E E E E E E 𝑄 𝑦S, 𝑦R, 𝑦N, 𝑦K, 𝑦T, 𝑦O, 𝑦̅M, 𝑦Q
œŸ œ œ¡ œ¢ œ£ œ¤
Consider the elimination order 𝑌S, 𝑌N, 𝑌K, 𝑌T, 𝑌O, 𝑌Q
𝑄 𝑦R, 𝑦̅M = E E E E E E 𝑄 𝑦S 𝑄 𝑦R 𝑄 𝑦N 𝑦S, 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄(𝑦̅M|𝑦K, 𝑦T)𝑄 𝑦Q 𝑦̅M
œ¤ œ£ œ¢ œ¡ œ œŸ
𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q
58 𝑄 𝑦R, 𝑦̅M = E E E E E 𝑄 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄(𝑦̅M|𝑦K, 𝑦T)𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦S 𝑄 𝑦N 𝑦S, 𝑦R
œ¤ œ£ œ¢ œ¡ œ œŸ
= E E E E E 𝑄 𝑦R 𝑄 𝑦K 𝑦N 𝑄 𝑦T 𝑦R 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M 𝑛S(𝑦R, 𝑦N)
œ£ œ¢ œ¡ œ œŸ
= E E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦K 𝑦N 𝑄 𝑦O 𝑦N, 𝑦̅M 𝑛S(𝑦R, 𝑦N)
œ£ œ¢ œ¡ œ œŸ
= E E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑄 𝑦Q 𝑦̅M 𝑛N(𝑦R, 𝑦O, 𝑦K)
œ¢ œ¡ œ œŸ
= E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦̅M 𝑦K, 𝑦T 𝑛N(𝑦R, 𝑦O, 𝑦K)
œ¢ œ¡ œ œŸ
= E E E 𝑄 𝑦R 𝑄 𝑦T 𝑦R 𝑄 𝑦Q 𝑦̅M 𝑛K(𝑦R, 𝑦T, 𝑦O)
œ¡ œ œŸ
= E E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑄 𝑦T 𝑦R 𝑛K(𝑦R, 𝑦T, 𝑦O)
œ¡ œ œŸ
= E E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M 𝑛T(𝑦R, 𝑦O)
œ œŸ
= E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M E 𝑛T(𝑦R, 𝑦O)
œ œŸ
= E 𝑄 𝑦R 𝑄 𝑦Q 𝑦̅M
œŸ
𝑛O(𝑦R) = 𝑛Q(𝑦R)𝑛O(𝑦R)
59
ϴ
60
} Graph elimination is a simple unified treatment of inference
} Moralize the graph
} All parents of a node are connected to each other
} Graph-theoretic property: the factors resulted during variable
} The computational complexity of the Eliminate algorithm can
61
} Begin with the moralized BN } Choose an elimination ordering (query nodes should be last) } Eliminate a node from the graph and add edges (called fill
} Iterate until all non-query nodes are eliminated
62
𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q 𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌M 𝑌Q 𝑌S 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌N 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌K 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌T 𝑌O 𝑌Q 𝑌R 𝑌O 𝑌Q 𝑌R 𝑌Q 𝑌R Removing a node from the graph and connecting the remaining neighbors Moralized graph
Summation ⇔ elimination Intermediate term ⇔ elimination clique
fill edges
63
} Induced dependency during marginalization is captured in
} A correspondence between maximal cliques in the induced
} The complexity depends on the number of variables in the largest
elimination clique
} The size of the maximal elimination clique in the induced
64
} Finding the best elimination ordering is NP-hard
} Equivalent to finding the tree-width in the graph that is NP-
} Tree-width: one less than the smallest achievable size of the
} Good elimination orderings lead to small cliques and
} What is the optimal order for trees?
} A polytree is a directed graph with no undirected cycles } For poly-trees you can always find an ordering that is efficient
}
Try it!!
} Cut-set conditioning for Bayes’ net inference
}
Choose set of variables such that if removed only a polytree remains
}
Exercise:Think about how the specifics would work out!
65
}
CSP:
}
If we can answer P(z) equal to zero or not, we answered whether the 3-SAT problem has a solution.
}
Hence inference in Bayes’ nets is NP-hard. No known efficient probabilistic inference in general.
… …
66
} Interleave joining and marginalizing } dk entries computed for a factor over k
variables with domain sizes d
} Ordering of elimination of variables
can affect size of factors generated
} Worst case: running time exponential
in the size of the Bayes’ net
… … 67
}
Representation
}
Conditional Independences
}
Probabilistic Inference
}
Enumeration (exact, exponential complexity)
}
Variable elimination (exact, worst-case exponential complexity, often better)
}
Inference is NP-complete
}
Sampling (approximate)
}
Learning Bayes’ Nets from Data
68
69
} Sampling is a lot like repeated simulation
}
Predicting the weather, basketball games, …
} Basic idea
}
Draw N samples from a sampling distribution S
}
Compute an approximate posterior probability
}
Show this converges to the true probability P
Ø Why sample?
§ Learning: get samples from a distribution you don’t know § Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)
70
} Sampling from given distribution
}
Step 1: Get sample u from uniform distribution over [0, 1)
}
E.g. random() in python
}
Step 2: Convert this sample u into an outcome for the given distribution by having each
§ If random() returns u = 0.83, then
§ E.g, after sampling 8 times:
C P(C) red 0.6 green 0.1 blue 0.3
71
} Prior Sampling } Rejection Sampling } Likelihood Weighting } Gibbs Sampling
72
73
Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass
+c 0.5
0.5 +c +s 0.1
0.9
+s 0.5
0.5 +c +r 0.8
0.2
+r 0.2
0.8 +s +r +w 0.99
0.01
+w 0.90
0.10
+r +w 0.90
0.10
+w 0.01
0.99
Samples: +c, -s, +r, +w
…
74
} For i=1, 2, …, n
} Sample xi from P(Xi | Parents(Xi))
} Return (x1, x2, …, xn)
75
} This process generates samples with probability:
…i.e. the BN’s joint probability
} Let the number of samples of an event be } Then } I.e., the sampling procedure is consistent
76
} We’ll get a bunch of samples from the BN:
+c, -s, +r, +w +c, +s, +r, +w
+c, -s, +r, +w
} If we want to know P(W)
} We have counts <+w:4, -w:1> } Normalize to get P(W) = <+w:0.8, -w:0.2> } This will get closer to the true distribution with more samples } Can estimate anything else, too } What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? } Fast: can use fewer samples if less time (what’s the drawback?)
S R W C
77
78
+c, -s, +r, +w +c, +s, +r, +w
+c, -s, +r, +w
} Let’s say we want P(C)
} No point keeping all samples around } Just tally counts of C as we go
} Let’s say we want P(C| +s)
} Same thing: tally C outcomes, but ignore
(reject) samples which don’t have S=+s
} This is called rejection sampling } It is also consistent for conditional probabilities
(i.e., correct in the limit) S R W C
79
}
IN: evidence instantiation
}
For i=1, 2, …, n
}
Sample xi from P(Xi | Parents(Xi))
}
If xi not consistent with evidence
}
Reject: Return, and no sample is generated in this cycle }
Return (x1, x2, …, xn)
80
81
§ Idea: fix evidence variables and sample the rest
§ Problem: sample distribution not consistent! § Solution: weight by probability of evidence given parents
} Problem with rejection sampling:
}
If evidence is unlikely, rejects lots of samples
}
Evidence not exploited as you sample
}
Consider P(Shape|blue)
Shape Color Shape Color
pyramid, green pyramid, red sphere, blue cube, red sphere, green pyramid, blue pyramid, blue sphere, blue cube, blue sphere, blue
82
+c 0.5
0.5 +c +s 0.1
0.9
+s 0.5
0.5 +c +r 0.8
0.2
+r 0.2
0.8 +s +r +w 0.99
0.01
+w 0.90
0.10
+r +w 0.90
0.10
+w 0.01
0.99
Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass
83
}
IN: evidence instantiation
}
w = 1.0
}
for i=1, 2, …, n
}
if Xi is an evidence variable
}
Xi = observation xi for Xi
}
Set w = w * P(xi | Parents(Xi))
}
else
}
Sample xi from P(Xi | Parents(Xi))
}
return (x1, x2, …, xn), w
84
} Sampling distribution if z sampled and e fixed evidence } Now, samples have weights } Together, weighted sampling distribution is consistent
Cloudy R C S W
85
} Likelihood weighting is good
}
We have taken evidence into account as we generate the sample
}
E.g. here, W’s value will get picked based on the evidence values of S, R
}
More of our samples will reflect the state of the world suggested by the evidence
} Likelihood weighting doesn’t solve all our problems
}
Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)
} We would like to consider evidence when we sample every variable
à Gibbs sampling
86
87
} Procedure: keep track of a full instantiation x1, x2, …, xn.
}
Start with an arbitrary instantiation consistent with the evidence.
}
Sample one variable at a time, conditioned on all the rest, but keep evidence fixed.
}
Keep repeating this for a long time.
} Property: in the limit of repeating this infinitely many times the resulting sample is
coming from the correct distribution
} Rationale: both upstream and downstream variables condition on evidence. } In contrast: likelihood weighting only conditions on upstream evidence, and hence
weights obtained in likelihood weighting can sometimes be very small.
}
Sum of weights over all samples is indicative of how many “effective” samples were obtained, so want high weight.
88
} Step 1: Fix evidence
}
R = +r
} Step 2: Initialize other variables
}
Randomly
} Steps 3: Repeat
}
Choose a non-evidence variable X
}
Resample X from P( X | all other variables)
S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C S +r W C
89
} How is this better than sampling from the full joint?
} In a Bayes’ Net, sampling a variable given all the other variables
} Only requires a join on the variable to be sampled (in this case, a join on R) } The resulting factor only depends on the variable’s parents, its children, and its children’s
parents (this is often referred to as its Markov blanket)
90
}
Sample from P(S | +c, +r, -w)
} Many things cancel out – only CPTs with S remain! } More generally: only CPTs that have resampled variable need to be
considered, and joined together
S +r W C
91
§
Prior Sampling P
§
LikelihoodWeighting P( Q | e) § Rejection Sampling P( Q | e ) § Gibbs Sampling P( Q | e )
92
} Gibbs sampling produces sample from the query distribution P(Q|e)
in limit of re-sampling infinitely often
} Gibbs sampling is a special case of more general methods called
Markov chain Monte Carlo (MCMC) methods
} Metropolis-Hastings is one of the more famous MCMC methods (in fact, Gibbs
sampling is a special case of Metropolis-Hastings)
} You may read about Monte Carlo methods – they’re just sampling
93