Probabilistic Graphical Models
CMSC 678 UMBC
Probabilistic Graphical Models CMSC 678 UMBC Probabilistic - - PowerPoint PPT Presentation
Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that represents a probability distribution over random variables 1 , , Probabilistic Graphical Models A graph G that represents a
Probabilistic Graphical Models
CMSC 678 UMBC
Probabilistic Graphical Models
A graph G that represents a probability distribution over random variables π1, β¦ , ππ
Probabilistic Graphical Models
A graph G that represents a probability distribution over random variables π1, β¦ , ππ Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ)
Probabilistic Graphical Models
A graph G that represents a probability distribution
Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ) Vertices β random variables Edges show dependencies among random variables
Probabilistic Graphical Models
A graph G that represents a probability distribution
Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ) Vertices β random variables Edges show dependencies among random variables Two main flavors: directed graphical models and undirected graphical models
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ
Benefit: read the independence properties are transparent
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ A graph/joint distribution that follows this is a
Bayesian network
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
βparents ofβ topological sort
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
π π¦1, π¦2, π¦3, π¦4, π¦5 = ???
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, π¦4, π¦5 = π π¦1 π π¦3 π π¦2 π¦1, π¦3 π π¦4 π¦2, π¦3 π(π¦5|π¦2, π¦4)
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
exact inference in general DAGs is NP-hard inference in trees can be exact
Directed Graphical Model Notation
π¦1 π¦4 π¦3 5 π¦2
Shaded nodes are
Unshaded nodes are unobserved (latent) R.V.s
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
the path from X to Y
the path from X to Y not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
the path from X to Y
the path from X to Y not observing Z blocks the path from X to Y π π¦, π§, π¨ = π π¦ π π§ π(π¨|π¦, π§) π π¦, π§ = ΰ·
π¨π π¦ π π§ π(π¨|π¦, π§) = π π¦ π π§
Markov Blanket
x Markov blanket of a node x is its parents, children, and children's parents
π π¦π π¦πβ π = π(π¦1, β¦ , π¦π) β« π π¦1, β¦ , π¦π ππ¦π = Οπ π(π¦π|π π¦π ) β« Οπ π π¦π π π¦π ) ππ¦π
factor out terms not dependent on xi
factorization
= Οπ:π=π or πβπ π¦π π(π¦π|π π¦π ) β« Οπ:π=π or πβπ π¦π π π¦π π π¦π ) ππ¦π
the set of nodes needed to form the complete conditional for a variable xi
(in this example, shading does not show
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
NaΓ―ve Bayes argmaxπ log π π π) + log π(π)
likelihood prior
argmaxππ π π)
Apply Bayes rule and take logs
NaΓ―ve Bayes argmaxπ log π π π) + log π(π) argmaxππ π π)
Apply Bayes rule and take logs
Represent X is a D-dimensional vector (of features): π = (π1, π2, π3, β¦ , ππΈ)
NaΓ―ve Bayes argmaxπ log π π π) + log π(π) argmaxππ π π)
argmaxπ ΰ·
π=1 πΈ
log π(π
π|π) + log π(π)
Apply Bayes rule and take logs Naively generate each βfeatureβ
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
25
Adapted from Jurafsky & Martin (draft)
Bag of Words Representation
seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
classifier classifier
Adapted from Jurafsky & Martin (draft)
NaΓ―ve Bayes: A Generative Story
Generative Story
π = distribution over πΏ labels for label π = 1 to πΏ:
global parameters
ππ = generate parameters
π π¦ππ π§ = π) π(π§ = π)
ΰ·
π=1 πΈ
log π(πππ|π
π) + log π(π π)
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
π = distribution over πΏ labels
y
for label π = 1 to πΏ: ππ = generate parameters
Choose the label
ΰ·
π=1 πΈ
log π(πππ|π
π) + log π(π π)
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
π = distribution over πΏ labels for each feature π π¦ππ βΌ Fπ(ππ§π)
π¦π1 π¦π2 π¦π3 π¦π4 π¦π5
y
for label π = 1 to πΏ: ππ = generate parameters
local variables Generate each feature based on the label
ΰ·
π=1 πΈ
log π(πππ|π
π) + log π(π π)
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
π = distribution over πΏ labels
π¦π1 π¦π2 π¦π3 π¦π4 π¦π5
y
for label π = 1 to πΏ:
each xij is conditionally independent of one another (given the label)
ππ = generate parameters for each feature π π¦ππ βΌ Fπ(ππ§π)
ΰ·
π=1 πΈ
log π(πππ|π
π) + log π(π π)
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
β π = ΰ·
π
ΰ·
π
log πΊ
π§π(π¦ππ; ππ§π) + ΰ· π
log ππ§π s. t.
Maximize Log-likelihood
π = distribution over πΏ labels
π¦π1 π¦π2 π¦π3 π¦π4 π¦π5
y
for label π = 1 to πΏ:
ΰ·
π
ππ = 1 ππ is valid for πΊ
π
ππ = generate parameters for each feature π π¦ππ βΌ Fπ(ππ§π)
ππ β₯ 0
Multinomial NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
β π = ΰ·
π
ΰ·
π
log ππ§π,π¦π,π + ΰ·
π
log ππ§π s. t.
Maximize Log-likelihood
π = distribution over πΏ labels for each feature π π¦ππ βΌ Cat(ππ§π)
π¦π1 π¦π2 π¦π3 π¦π4 π¦π5
y
for label π = 1 to πΏ:
ππ = distribution over J feature values
ΰ·
π
ππ = 1 ΰ·
π
πππ = 1 βπ πππ β₯ 0, ππ β₯ 0
Multinomial NaΓ―ve Bayes: A Generative Story
for item π = 1 to π: π§π ~ Cat π
Generative Story
β π = ΰ·
π
ΰ·
π
log ππ§π,π¦π,π + ΰ·
π
log ππ§π β π ΰ·
π
ππ β 1 β ΰ·
π
ππ ΰ·
π
πππ β 1
Maximize Log-likelihood via Lagrange Multipliers (β₯ π constraints not shown)
π = distribution over πΏ labels for each feature π π¦ππ βΌ Cat(ππ§π,π)
π¦π1 π¦π2 π¦π3 π¦π4 π¦π5
y
for label π = 1 to πΏ:
ππ = distribution over J feature values
Multinomial NaΓ―ve Bayes: Learning
Calculate class priors For each k:
itemsk = all items with class = k
Calculate feature generation terms For each k:
items labeled as k Foreach feature j nkj = # of occurrences of j in obsk
π π = |itemsπ| # items π π|π = πππ Οπβ² πππβ²
Brill and Banko (2001) With enough data, the classifier may not matter
Adapted from Jurafsky & Martin (draft)
Summary: NaΓ―ve Bayes is Not So NaΓ―ve, but not without issue
Pro
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
Con
Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there βbetterβ ways of handling missing/noisy data?
(automated, more principled)
Adapted from Jurafsky & Martin (draft)
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields Undirected graphs can have an alternative formulation as Factor Graphs
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π
clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials ππ·?
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials ππ·? A: ππ· β₯ 0 (or ππ· > 0)
Terminology: Potential Functions
ππ· π¦π = exp βπΉ(π¦π·)
energy function (for clique C) Boltzmann distribution
(get the total energy of a configuration by summing the individual energy functions)
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Ambiguity in Undirected Model Notation
X Y Z
π π¦, π§, π¨ β π(π¦, π§, π¨) π π¦, π§, π¨ β π1 π¦,π§ π2 π§,π¨ π3 π¦,π¨
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T
X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables
X Y Z X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions
X Y Z X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
X Y Z X Y Z
Different Factor Graph Notation for the Same Graph
X Y Z X Y Z X Y Z
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
π π¦1, β¦ , π¦4 = π π¦1 π π¦2 π π¦3 π(π¦4|π¦1, π¦2, π¦3)
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
π π¦1, β¦ , π¦4 = π π¦1 π π¦2 π π¦3 π(π¦4|π¦1, π¦2, π¦3)
x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected
(e.g., conditional random field [CRF]) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging and named entity recognition
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
President Obama told Congress β¦
Person Person Org. Other
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
i=1 N
exp( π π , π
π π¨π
+ π π , ππ π¨π, π¨π+1 )
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Feature design, just like in maxent models!
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π) Example: ππ,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 ππ,told,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
X Y
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
Q: What are the cliques?
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
Q: What are the cliques?
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
πΉ π¦, π§ = β ΰ·
π
π¦π β πΎ ΰ·
ππ
π¦ππ¦π β π ΰ·
π
π¦ππ§π
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
πΉ π¦, π§ = β ΰ·
π
π¦π β πΎ ΰ·
ππ
π¦ππ¦π β π ΰ·
π
π¦ππ§π
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
πΉ π¦, π§ = β ΰ·
π
π¦π β πΎ ΰ·
ππ
π¦ππ¦π β π ΰ·
π
π¦ππ§π
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Q: Why subtract Ξ² and Ξ·?
Example: Ising Model
x: original pixel/state y:
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
πΉ π¦, π§ = β ΰ·
π
π¦π β πΎ ΰ·
ππ
π¦ππ¦π β π ΰ·
π
π¦ππ§π
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Q: Why subtract Ξ² and Ξ·? A: Better states β lower energy (higher potential) ππ· π¦π = exp βπΉ(π¦π·)
Markov Random Fields with Factor Graph Notation
x: original pixel/state y: observed (noisy) pixel/state
factor nodes are added according to maximal cliques
unary factor variable
factor graphs are bipartite
binary factor
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Two Problems for Undirected Models
Finding the normalizer Computing the marginals
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Q: Why are these difficult? Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add
number to the soldier on the
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Sum-Product Algorithm
Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case
Sum-Product
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
definition of marginal
β¦ β¦
Sum-Product
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
definition of marginal
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals
The factor nodes can act as filters
Sum-Product
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
definition of marginal
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals π
πβπ
π
πβπ
π
πβπ
Sum-Product
π π¦π = π€ = ΰ·
π
π
πβπ¦π(π¦π) alternative marginal computation
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals π
πβπ
π
πβπ
π
πβπ
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
n m
set of factors in which variable n participates
default value of 1 if empty product
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π
= ΰ·
ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
n m n m
set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed
default value of 1 if empty product
Example
Q: What are the variables? π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
Q: What are the variables? A: π¦1, π¦2, π¦3, π¦4 Q: What are the factors? π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
Q: What are the variables? A: π¦1, π¦2, π¦3, π¦4 Q: What are the factors? A: π
π π¦1, π¦2 ,
π
π π¦2, π¦3 ,
π
π(π¦2, π¦4)
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Q: What is the distribution weβre modeling?
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Q: What is the distribution weβre modeling?
A: π π¦1, π¦2, π¦3, π¦4 = π
π π¦1, π¦2 π π π¦2, π¦3 π π(π¦2, π¦4)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 =? ? ?
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 =? ? ? ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 = π
π
πβπ¦2 π¦2 π
π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 = π
π
πβπ¦2 π¦2 π
π
πβπ¦2 π¦2
π
ππβπ¦3 π¦3 = ΰ· π
π
π(π¦2 = π, π¦3)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = ? ? ?
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
We just computed this Q: Where did we compute this?
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
We just computed this Q: Where did we compute this? A: In step 1 (leaves β root)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
π
π
πβπ¦4 π¦4 = ΰ·
π
π
π(π¦2 = π, π¦4)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
π
π
πβπ¦4 π¦4 = ΰ·
π
π
π(π¦2 = π, π¦4)
π
π
πβπ¦1 π¦1 = ΰ·
π
π
π(π¦1, π¦2 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
π π¦3 = π
ππβπ¦3 π¦3
π π¦4 = π
π
πβπ¦4 π¦4
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities
2. ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities
converged result, orβ¦ 2. ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
probabilities
converged result, orβ¦
[Loopy BP]
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Max-Product (Max-Sum)
Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factorβvariable computations
π
πβπ π¦π = max ππ\π π π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
(why max-sum? computationally, implement with logs)
Loopy Belief Propagation
Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule
Outline
Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference