CS 6355: Structured Prediction
Graphical Models
1
Graphical Models CS 6355: Structured Prediction 1 So far We - - PowerPoint PPT Presentation
Graphical Models CS 6355: Structured Prediction 1 So far We discussed sequence labeling tasks: HMM: Hidden Markov Models MEMM: Maximum Entropy Markov Models CRF: Conditional Random Fields All these models use a linear chain
1
2
yt-1 yt xt yt-1 yt xt yt-1 yt xt HMM MEMM CRF
3
– Directed or undirected graphs
– Algorithms for computing marginal and conditional probabilities
– An “inference engine” – Can introduce prior probability distributions
4
5
Example from Russell and Norvig
6
Example from Russell and Norvig
7
Joint probability 𝑄 𝐶, 𝐹, 𝐵, 𝐾, 𝑁 = 𝑄 𝐶 ⋅ 𝑄 𝐹 ⋅ 𝑄 𝐵 𝐶, 𝐹 ⋅ 𝑄 𝐾 𝐵 ⋅ 𝑄 𝑁 𝐵
Example from Russell and Norvig
Joint probability 𝑄 𝐶, 𝐹, 𝐵, 𝐾, 𝑁 = 𝑄 𝐶 ⋅ 𝑄 𝐹 ⋅ 𝑄 𝐵 𝐶, 𝐹 ⋅ 𝑄 𝐾 𝐵 ⋅ 𝑄 𝑁 𝐵 The network and its parameters are a compact representation of the joint probability distribution
8
Example from Russell and Norvig
We can ask questions like:
an earthquake?”
probability that there was a burglary?”
9
10
Example from Daphne Koller
11
Example from Daphne Koller
If X, Y, Z are random variables, we write
12
Example from Daphne Koller
If X, Y, Z are random variables, we write
– 𝐺𝑚𝑣 ⊥ 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠 ∣ 𝑇𝑓𝑏𝑡𝑝𝑜 – 𝐷𝑝𝑜𝑓𝑡𝑢𝑗𝑝𝑜 ⊥ 𝑇𝑓𝑏𝑡𝑝𝑜 ∣ 𝐺𝑚𝑣, 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠
13
Example from Daphne Koller
If X, Y, Z are random variables, we write
– 𝐺𝑚𝑣 ⊥ 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠 ∣ 𝑇𝑓𝑏𝑡𝑝𝑜 – 𝐷𝑝𝑜𝑓𝑡𝑢𝑗𝑝𝑜 ⊥ 𝑇𝑓𝑏𝑡𝑝𝑜 ∣ 𝐺𝑚𝑣, 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠
14
Example from Daphne Koller
If X, Y, Z are random variables, we write
Parents of a node shield it from influence of ancestors and non-descendants… … but information about descendants can influence beliefs about a node.
Y ∣ MarkovBlanket 𝑌*
15
Example from Daphne Koller
If X, Y, Z are random variables, we write
Y ∣ MarkovBlanket 𝑌*
we know these variables, 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠 is independent of 𝑁𝑣𝑡𝑑𝑚𝑓𝑄𝑏𝑗𝑜
16
Example from Daphne Koller
If X, Y, Z are random variables, we write
Y ∣ MarkovBlanket 𝑌*
we know these variables, 𝐼𝑏𝑧𝑔𝑓𝑤𝑓𝑠 is independent of 𝑁𝑣𝑡𝑑𝑚𝑓𝑄𝑏𝑗𝑜
17
Example from Daphne Koller
If X, Y, Z are random variables, we write
The Markov blanket of a node shields it from influence of any other node
with its non-descendants given its parents.
independent of all other nodes given its parents, children and children’s parents—that is given its Markov Blanket.
18
(Xi ⊥ NonDescendants(Xi)|Parents(Xi)) (Xi ? Xj|MB(Xi)) for all j 6= i
Example from Daphne Koller
with its non-descendants given its parents.
independent of all other nodes given its parents, children and children’s parents—that is given its Markov Blanket.
19
(Xi ⊥ NonDescendants(Xi)|Parents(Xi)) (Xi ? Xj|MB(Xi)) for all j 6= i
Example from Daphne Koller
Where do the independence assumptions come from?
with its non-descendants given its parents.
independent of all other nodes given its parents, children and children’s parents—that is given its Markov Blanket.
20
(Xi ⊥ NonDescendants(Xi)|Parents(Xi)) (Xi ? Xj|MB(Xi)) for all j 6= i
Example from Daphne Koller
Where do the independence assumptions come from? Domain knowledge
21
22
23
24
25
26
Bad News: Inference in a Bayesian network is #𝑄 hard (i.e., as hard as counting the number of satisfying solutions of a CNF formula) More bad news: Even approximate inference in a Bayesian network is NP-hard! Good news: Efficient algorithms exist for networks with special structures.
Two problems:
dependencies show up. X8 is independent of everything given its Markov blanket (other circled nodes here)
27
Example from Kevin Murphy
28
29
Example from Kevin Murphy
Two problems:
dependencies show up. X8 is independent
(other circled nodes here)
Example from Kevin Murphy
30
– Nodes are random variables – Edges (hyper-edges) define dependencies
31
Cliques: {AB}, {BC}, {CD}, {AD}
– Nodes are random variables – Edges (hyper-edges) define dependencies
32
c∈Cliques
Cliques: {AB}, {BC}, {CD}, {AD}
P(A, B, C, D) = 1 Z f1(A, B)f2(B, C) f3(C, D)f4(A, D)
This is a Gibbs distribution if all factors are positive
– Nodes are random variables – Edges (hyper-edges) define dependencies
The joint probability decouples over
potential function f(xc)
33
c∈Cliques
P(A, B, C, D) = 1 Z f1(A, B)f2(B, C) f3(C, D)f4(A, D)
This is a Gibbs distribution if all factors are positive Cliques: {AB}, {BC}, {CD}, {AD}
34
Where do the independence assumptions come from?
35
Where do the independence assumptions come from? Domain knowledge
36
37
Z: Called the partition function, sum over all assignments to the random variables Normalize: where
f(xc, µ) is often written as exp(µT xc) Log-linear model
38
Z: Called the partition function, sum over all assignments to the random variables Normalize: where
Factor graph: Makes the factorization explicit, factors instead of cliques ?
f(xc, µ) is often written as exp(µT xc) Log-linear model
Which cliques?
39
Z: Called the partition function, sum over all assignments to the random variables Normalize: where
Factor graph: Makes the factorization explicit, factors instead of cliques
Factors Factors Factors
f(xc, µ) is often written as exp(µT xc) Log-linear model
?
1 2 3 4 5
40
Z: Called the partition function, sum over all assignments to the random variables Normalize: where
Factor graph: Makes the factorization explicit, factors instead of cliques
Factors Factors Factors
f(xc, µ) is often written as exp(µT xc) Log-linear model
?
Z: Zustandssumme, “sum over states”, more commonly called the partition function
41
Energy of clique c existing in state xc
Z: Zustandssumme, “sum over states”, more commonly called the partition function
42
43
Energy of clique c existing in state xc
Z: Zustandssumme, “sum over states”, more commonly called the partition function
44
– A set of conditional independence relations – i.e, a skeleton that shows how a joint probability distribution is factorized
– A BN can be converted into an MRF with a normalization constant one – A MRF can also be converted into a BN, but this may lead to a very large network
See the chapter on undirected graphical models in Koller and Friedman’s book
45
46
47
48
(more on this in future classes)
1 2 3 4 5
49
(more on this in future classes)
1 2 3 4 5
neighbor’s state should be
50
(more on this in future lectures) What makes an ordering good?
1 2 3 4 5
51
NP-hard in general, works for simple graphs (more on this in future lectures)
1 2 3 4 5
52