Probabilistic Graphical Models CMSC 691 UMBC Two Problems for - - PowerPoint PPT Presentation
Probabilistic Graphical Models CMSC 691 UMBC Two Problems for - - PowerPoint PPT Presentation
Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 , 2 , 3 , , = 1 Finding the normalizer Computing the marginals Two Problems for Graphical
Two Problems for Graphical Models
Finding the normalizer Computing the marginals
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Graphical Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Graphical Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Graphical Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Probabilistic Graphical Models
A graph G that represents a probability distribution over random variables π1, β¦ , ππ
Probabilistic Graphical Models
A graph G that represents a probability distribution over random variables π1, β¦ , ππ Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ)
Probabilistic Graphical Models
A graph G that represents a probability distribution
- ver random variables π1, β¦ , ππ
Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ) Vertices β random variables Edges show dependencies among random variables
Probabilistic Graphical Models
A graph G that represents a probability distribution
- ver random variables π1, β¦ , ππ
Graph G = (vertices V, edges E) Distribution π(π1, β¦ , ππ) Vertices β random variables Edges show dependencies among random variables Two main flavors: directed graphical models and undirected graphical models
Outline
Directed Graphical Models Undirected Graphical Models Factor Graphs
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ
Benefit: read the independence properties are transparent
Directed Graphical Models
A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes into factors of ππ conditioned on the parents of ππ A graph/joint distribution that follows this is a
Bayesian network
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
βparents ofβ topological sort
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
π π¦1, π¦2, π¦3, π¦4, π¦5 = ???
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, π¦4, π¦5 = π π¦1 π π¦3 π π¦2 π¦1, π¦3 π π¦4 π¦2, π¦3 π(π¦5|π¦2, π¦4)
Bayesian Networks: Directed Acyclic Graphs
π¦1 π¦4 π¦3 5 π¦2
π π¦1, π¦2, π¦3, β¦ , π¦π = ΰ·
π
π π¦π π(π¦π))
exact inference in general DAGs is NP-hard inference in trees can be exact
Directed Graphical Model Notation
π¦1 π¦4 π¦3 5 π¦2
Shaded nodes are
- bserved R.V.s
Unshaded nodes are unobserved (latent) R.V.s
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
- bserving Z blocks
the path from X to Y
- bserving Z blocks
the path from X to Y not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a βv-structureβ or βcolliderβ with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
- bserving Z blocks
the path from X to Y
- bserving Z blocks
the path from X to Y not observing Z blocks the path from X to Y π π¦, π§, π¨ = π π¦ π π§ π(π¨|π¦, π§) π π¦, π§ = ΰ·
π¨
π π¦ π π§ π(π¨|π¦, π§) = π π¦ π π§
Markov Blanket
x Markov blanket of a node x is its parents, children, and children's parents
π π¦π π¦πβ π = π(π¦1, β¦ , π¦π) β« π π¦1, β¦ , π¦π ππ¦π = Οπ π(π¦π|π π¦π ) β« Οπ π π¦π π π¦π ) ππ¦π
factor out terms not dependent on xi
factorization
- f graph
= Οπ:π=π or πβπ π¦π π(π¦π|π π¦π ) β« Οπ:π=π or πβπ π¦π π π¦π π π¦π ) ππ¦π
the set of nodes needed to form the complete conditional for a variable xi
(in this example, shading does not show
- bserved/latent)
Outline
Directed Graphical Models Undirected Graphical Models Factor Graphs
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields
Undirected Graphical Models
An undirected graph G=(V,E) that represents a probability distribution over random variables π1, β¦ , ππ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields Undirected graphs can have an alternative formulation as Factor Graphs
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π
clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials ππ·?
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials ππ·? A: ππ· β₯ 0 (or ππ· > 0)
Terminology: Potential Functions
ππ· π¦π = exp βπΉ(π¦π·)
energy function (for clique C) Boltzmann distribution
(get the total energy of a configuration by summing the individual energy functions)
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Ambiguity in Undirected Model Notation
X Y Z
π π¦, π§, π¨ β π(π¦, π§, π¨) π π¦, π§, π¨ β π1 π¦,π§ π2 π§,π¨ π3 π¦,π¨
Outline
Directed Graphical Models Undirected Graphical Models Factor Graphs
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T
X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables
X Y Z X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions
X Y Z X Y Z
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
X Y Z X Y Z
Different Factor Graph Notation for the Same Graph
X Y Z X Y Z X Y Z
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
π π¦1, β¦ , π¦4 = π π¦1 π π¦2 π π¦3 π(π¦4|π¦1, π¦2, π¦3)
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
π π¦1, β¦ , π¦4 = π π¦1 π π¦2 π π¦3 π(π¦4|π¦1, π¦2, π¦3)
x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected
(e.g., conditional random field [CRF]) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging and named entity recognition
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
President Obama told Congress β¦
Person Person Org. Other
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π β£|β’
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π π¨1, π¨2, β¦ , π¨π|β’
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π π¨1, π¨2, β¦ , π¨π|π¦1:π
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
i=1 N
exp( π π , π
π π¨π
+ π π , ππ π¨π, π¨π+1 )
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Feature design, just like in maxent models!
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π) Example: ππ,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 ππ,told,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0