Probabilistic Graphical Models: Bayesian Networks Li Xiong
Slide credits: Page (Wisconsin) CS760, Zhu (Wisconsin) KDD ’12 tutorial
Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , - - PowerPoint PPT Presentation
Probabilistic Graphical Models: Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , Zhu (Wisconsin) KDD 12 tutorial Outline Graphical models Bayesian networks - definition Bayesian networks - inference Bayesian
Slide credits: Page (Wisconsin) CS760, Zhu (Wisconsin) KDD ’12 tutorial
Graphical models Bayesian networks - definition Bayesian networks - inference Bayesian networks - learning
November 5, 2017 Data Mining: Concepts and Techniques 2
Overview
There are two envelopes, one has a red ball ($100)
You randomly picked an envelope, randomly took
At this point, you are given the option to switch
Overview
Random variables E ∈{1, 0}, B ∈{r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b)
Overview
Random variables E ∈{1, 0}, B ∈{r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) P (E = 1 | B = b) = =
P(B=b|E=1)P (E=1) 1/2×1/2 P (B=b ) 3/4
= 1/3
B
The graphical model:
E
Overview
e.g. (x1, . . . , xn−1) a feature vector , xn ≡ y the class label
p(XQ | XE ) where XQ ∪XE ⊆ {x1 ...xn}
e.g. Q = {n}, E = {1 ...n − 1}, by the definition of conditional p(x | x , . . . , x
n 1 n −1) =
p(x1, ...,xn−1, xn) Σ v p(x1, ...,xn−1, xn = v)
1 n
where X(i) = (x(i), ...,x(i))
Overview
exponential naive storage (2n for binary r .v.) hard to interpret (conditional independence)
Often can’t afford to do it by brute force
Often can’t afford to do it by brute force
Definitions
Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model These are not graphical models: neural network decision tree network flow HMM template
Bayesian networks – directed Markov networks – undirected
November 5, 2017 Data Mining: Concepts and Techniques 9
Graphical models Bayesian networks - definition Bayesian networks - inference Bayesian networks - learning
November 5, 2017 Data Mining: Concepts and Techniques 10
Nodes are random variables Directed edges between nodes reflect dependence
Understood Material Assignment Grade Exam Grade Alarm Smoking At Sensor Fire
a set of conditional probability distributions
– each node denotes random a variable – each edge from X to Y represents that X directly influences Y – formally: each variable X is independent of its non- descendants given its parents
(CPD) representing P(X | Parents(X) )
Definitions Directed Graphical Models
Two r .v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r .v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent)
P(X1, … , Xn )
probability distribution
n
i1
P(X1, … , Xn ) P(X1)P(Xi | X1, … , Xi1))
i2
expressed as
n
B = a burglary occurs at your house E = an earthquake occurs at your house A = the alarm goes off J = John calls to report the alarm M = Mary calls to report the alarm
P(B | M, J) ?
Burglary Earthquake Alarm JohnCalls MaryCalls
B E t f t t 0.95 0.05 t f 0.94 0.06 f t 0.29 0.71 f f 0.001 0.999
P ( A | B, E )
t f 0.001 0.999
P ( B )
t f 0.001 0.999
P ( E )
A t f t f 0.9 0.1 0.05 0.95
P ( J |A)
A t f t f 0.7 0.3 0.01 0.99
P ( M |A)
P(B,E,A,J, M ) P(B) P(E) P(A | B,E) P(J | A) P(M | A)
Alarm example has 25 = 32 parameters
Burglary Earthquake Alarm JohnCalls MaryCalls
= 1024
graph structure have?
2 4 4 = 42 4 4 4 4 4 8 4
representation of the joint distribution have?
where they exist
variables where dependencies exist
the complexity of inference
20
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Network learning
Given: values for some variables in the network (evidence), and a set of query variables Do: compute the posterior distribution over the query variables
variables are other variables
be the evidence variables and any set can be the query variables
Derive the maximum posterior Independence assumption Simplified network
) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P
) | ( ... ) | ( ) | ( 1 ) | ( ) | (
2 1
Ci x P Ci x P Ci x P n k Ci x P Ci P
n k
X
Inference Exact Inference
Let X = (XQ, XE ,XO) for query , evidence, and other variables. Infer P(XQ | XE ) By definition
Q E
P (X | X ) =
Q E
P (X , X ) P(XE) Σ X O
Q E O
P (X , X , X ) = ΣX Q ,XO P (XQ, XE ,XO)
“probability the house is being burglarized given that John and Mary both called”
e a
sum over possible values for E and A variables (e, ¬e, a, ¬a)
A B E M J
B E P(A) t t 0.95 t f 0.94 f t 0.29 f f 0.001 P(B) 0.001 P(E) 0.001 A P(J) t f 0.9 0.05 A P(M) t f 0.7 0.01
e a
e a
0.001 (0.001 0.95 0.9 0.7 0.001 0.05 0.05 0.01 0.999 0.94 0.9 0.7 0.999 0.06 0.05 0.01)
e, a e, ¬a ¬e, a ¬ e, ¬ a B E A J M
A B E M J
Inference Exact Inference
Let X = (XQ, XE ,XO) for query , evidence, and other variables. Infer P(XQ | XE ) By definition
Q E
P (X | X ) =
Q E
P (X , X ) P(XE) Σ X O
Q E O
P (X , X , X ) = Σ X Q ,XO P (XQ, XE ,XO) Computational issue: summing exponential number of terms - with k variables in XO each taking r values, there are rk terms
28
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Network learning
A B E M J
Inference Markov Chain Monte Carlo
P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) =0.9 P(J | ~A) = 0.05 P(M | ~A) =0.01
A J M
P(M | A) =0.7
B E
P(E)=0.002 P(B)=0.001
To generate a sample X = (B, E, A, J,M ): Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B = 0 Sample E ∼ Ber(0.002) If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05) If A = 1 sample M ∼ Ber(0.7) else M ∼Ber(0.01)
Inference Markov Chain Monte Carlo
Given inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) p(B = 1 | E = 1, M = 1) ≈ 1 m
m
Σ
i=1
1(B (i )=1) where m is the number of surviving samples
Issue: Can be highly inefficient (note P (E = 1) tiny), few samples agree with the evidence
Inference Markov Chain Monte Carlo
Initialization:
Fix evidence; randomly set other variables e.g. X(0) = (B = 0, E = 1, A = 0, J = 0, M = 1)
P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) = 0.9 P(J | ~A) =0.05
A J B
P(E)=0.002 P(B)=0.001
E=1 M=1
P(M | A) = 0.7 P(M | ~A) =0.01
Inference Markov Chain Monte Carlo
resample xi ∼ P (xi | X−i) equivalent to xi ∼ P (xi | MarkovBlanket(xi))
P (xi | MarkovBlanket(xi)) ∝ P (xi |P a(xi))
P (y |P a(y))
y∈C(xi)
where P a(x) are the parents of x, and C(x) the children of x. Example: B ∼ P (B | E = 1, A = 0, J = 0, M = 1) ∝ P (B | E = 1, A = 0) ∝P (B)P (A = 0 | B, E = 1)
P(J | A) = 0.9 P(M | A) =0. 7 P(B)=0.001 P(E)=0.002
B
E=1
P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001
A J
M=1
Inference Markov Chain Monte Carlo
X(1) = (B = 1, E = 1, A = 0, J = 0, M = 1)
samples with B = 1.
P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) = 0.9 P(J | ~A) =0.05
A J B
P(E)=0.002 P(B)=0.001
E=1 M=1
P(M | A) = 0.7 P(M | ~A) =0.01
Inference Markov Chain Monte Carlo
π = Tπ
Ti((X−i, xji) | (X−i, xi)) = p(xji | X−i), and stationarydistribution p(xQ | XE )
38
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Parameter learning and inference with partial data Network learning
B E A J M f f f t f f t f f f f f t f t …
Burglary Earthquake Alarm JohnCalls MaryCalls
unobservable data), the graph structure of a BN
infer the missing data values
B E A J M f f ? t f f t ? f f f f ? f t …
Burglary Earthquake Alarm JohnCalls MaryCalls
parameters of the CPDs too)
B E A J M f f f t f f t f f f f f t f t …
42
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Parameter and data learning with partial data Network learning
B E A J M f f f t f f t f f f f f t f t …
Burglary Earthquake Alarm JohnCalls MaryCalls
– given a model structure (e.g. a Bayes net graph) and a set of data D – set the model parameters θ to maximize P(D | θ)
the model P(D | θ)
for h heads in n flips the MLE is h/n
x1
L = P(x|) (1)1x1
xn
nxi
consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence of flips
the likelihood function for θ is given by:
…
P( j | a) 3 0.75 4 P(j | a) 1 0.25 4 P( j | a) 2 0.5 4 P(j | a) 2 0.5 4 P(b) 1 0.125 8 7 P(b) 0.875 8
B E A J M f f f t f f t f f f f f f t t t f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
consider estimating the CPD parameters for B and J in the alarm network given the following data set
P(b) 0 0 8 8 P(b) 1 8
B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
suppose instead, our data set was this… do we really want to set this to 0?
data, we could start with some prior belief for each
value v
vValues(X)
pseudocounts
a more general form: m-estimates
vValues( X)
number of “virtual” instances prior probability of value x
B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
now let’s estimate parameters for B using m=4 and pb=0.25
56
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Parameter learning + inference Network learning
unobservable data), the graph structure of a BN
infer the missing data values
B E A J M f f ? t f f t ? f f f f ? f t …
Burglary Earthquake Alarm JohnCalls MaryCalls
Given:
Repeat until convergence
expectation over missing values
compute/update maximum likelihood (MLE) or maximum posterior probability (MAP , maximum a posteriori) parameters
B E A J M f f ? f f f f ? t f t f ? t t f f ? f t f t ? t f f f ? f t t t ? t t f f ? f f f f ? t f f f ? f t
A B E M J
B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t f 0.9 0.2 A P(M) t f 0.8 0.1
suppose we’re given the following initial BN and training set
A J M
t: 0.0069 f: 0.9931
f f t f t t f t t f f t t t f f t f B E f f f f t f f f f t f f t t f f f f f f
t:0.2 f:0.8 t:0.98 f: 0.02 t: 0.2 f: 0.8 t: 0.3 f: 0.7 t:0.2 f: 0.8 t: 0.997 f: 0.003 t: 0.0069 f: 0.9931 t:0.2 f: 0.8 t: 0.2 f: 0.8
f t
A B E M J
B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t f 0.9 0.2 A P(M) t f 0.8 0.1
P(a | b,e,j,m) P(a | b,e,j,m)
P(b,e,a,j,m) P(a | b,e,j,m) P(b,e,a,j,m) P(b,e,a,j,m) 0.9 0.8 0.2 0.1 0.2 0.9 0.8 0.2 0.1 0.2 0.9 0.8 0.8 0.8 0.9 0.00288 .4176 0.0069 P(b,e,a, j,m) P(a | b,e, j,m) P(b,e,a, j,m) P(b,e,a, j,m) 0.9 0.8 0.2 0.9 0.2 0.9 0.8 0.2 0.9 0.2 0.9 0.8 0.8 0.2 0.9 0.02592 .1296 0.2 P(a |b,e, j,m) P(b,e,a, j,m) P(b,e,a, j,m) P(b,e,a, j,m) 0.1 0.8 0.6 0.9 0.8 0.1 0.8 0.6 0.9 0.8 0.1 0.8 0.4 0.2 0.1 0.03456 .0352 0.98
B E A J M f f
t: 0.0069
f f
f: 0.9931
f f
t:0.2
t f
f:0.8
t f
t:0.98
t t
f: 0.02
f f
t: 0.2
f t
f: 0.8
f t
t: 0.3
t f
f: 0.7
f f
t:0.2
f t
f: 0.8
t t
t: 0.997
t t
f: 0.003
f f
t: 0.0069
f f
f: 0.9931
f f
t:0.2
t f
f: 0.8
f f
t: 0.2
f t
f: 0.8
A B E M J
P(a |b,e) 0.997 1 P(a |b,e) 0.98 1 P(a | b,e) 0.3 1 P(a | b,e) 0.0069 0.2 0.2 0.2 0.0069 0.2 0.2 7
P(a |b,e) E #(a b e) E #(b e)
re-estimate probabilities using expected counts
B E P(A) t t 0.997 t f 0.98 f t 0.3 f f 0.145
re-estimate probabilities for P(J | A) and P(M | A) in same way
converge
(MLE or MAP)
(initial estimated probability parameters)
65
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Network learning
– search very restricted space of possible structures (e.g. networks with tree DAGs) – use heuristic search (e.g. sparse candidate)
(Chow & Liu 1968)
likelihood of the training data
x values(X ) y values(Y )
Joint
November 5, 2017 Data Mining: Concepts and Techniques 72
tree that connects all vertices in a graph
A C D E F G
1 5 1 5 1 7 1 8 1 9 1 7
B
1 15 1 6 1 8 1 9 1 11
given: graph with vertices V and edges E Vnew Enew ← { v } where v is an arbitrary vertex from V ← { } repeat until Vnew = V { choose an edge (u, v) in E with max weight where u is in Vnew and v is not add v to Vnew and (u, v) to Enew } return Vnew and Enew which represent an MST
given: graph with vertices V and edges E Enew ← { } for each (u, v) in E ordered by weight (from high to low) { remove (u, v) from E if adding (u, v) to Enew does not create a cycle add (u, v) to Enew } return V and Enew which represent an MST
A B C D E F G A C D E F G
1 5 1 5 1 7 1 8 1 9 1 7
B
1 15 1 6 1 8 1 9 1 11
maximizes the data likelihood?
– Why can we represent data likelihood as sum of I(X;Y)
– Why can we pick any direction for edges in the tree?
G
2
(d ) | Pa(X)) i i
D I(Xi, Pa(Xi)) H(Xi))
i
P(Xi,Pa(Xi))log2P(Xi,Pa(Xi)) /Pa(Xi)) D
i values(Xi,Pa(Xi))
dD i
i values(Xi,Pa( Xi))
i values(Xi,Pa(Xi))
P(Xi, Pa(Xi))log2P(Xi)
data likelihood given directed edges of G, best fit parametersG (since summing over all examples is equivalent to computing average
G
2 i i
data likelihood given directed edges
dD i
i
we’re interested in finding the graph G that maximizes this
i
if we assume a tree, one node has no parents, all others have exactly one
(Xi,X j)edges
edge directions don’t matter for likelihood, because MI is symmetric
– search very restricted space of possible structures (e.g. networks with tree DAGs) – use heuristic search (e.g. sparse candidate)
net structure
– state transition operators – scoring function for states – search algorithm (how to move through state space)
A B C D A B C D
add an edge
A B C D
reverse an edge given the current network at some stage of the search, we can…
A B C
delete an edge
D
then score can be decomposed as follows (and so can some other scores we’ll see later)
i
– score a network by summing terms over the nodes in the network – efficiently score changes in a local search procedure
given: data set D, initial network B0 i = 0 Bbest ←B0 while stopping criteria not met { for each possible operator application a { Bnew ← apply(a, Bi) if score(Bnew) > score(Bbest) Bbest ← Bnew } ++i Bi ← Bbest } return Bi
[Friedman et al., UAI 1999] given: data set D, initial network B0, parameter k i = 0 repeat { ++i // restrict step select for each variable Xj a set Cji (|Cj | ≤ k) of candidate parents
i
// maximize step find network Bi maximizing score among networks where ∀Xj, Parents(Xj) ⊆Cji } until convergence return Bi
the mutual information between pairs of variables, select top k
mutual information:
x,y
x,y
reverse-edge operators
likelihood of the data?
G
likelihood of the data?
G
(e.g. a tree), then maybe.
complexity penalty Akaike Information Criterion (AIC): Bayesian Information Criterion (BIC):
99
Graphical models Bayesian networks - definition Bayesian networks – inference
Exact inference Approximate inference
Bayesian networks – learning
Parameter learning Parameter learning + inference Network learning