Models CMSC 678 UMBC Announcement 1: Progress Report on Project - - PowerPoint PPT Presentation
Models CMSC 678 UMBC Announcement 1: Progress Report on Project - - PowerPoint PPT Presentation
Undirected Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59 AM Build on the proposal: Update to address comments Discuss the progress youve made Discuss what remains
Announcement 1: Progress Report on Project
Due Monday April 16th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you’ve made Discuss what remains to be done Discuss any new blocks you’ve experienced (or anticipate experiencing) Any questions?
Announcement 2: Assignment 4
Due Monday May 14th, 11:59 AM Topic: probabilistic & graphical modeling
Recap from last time…
Hidden Markov Model Representation
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
represent the probabilities and independence assumptions in a graph
Viterbi Algorithm
v = double[N+2][K*] b = int[N+2][K*] v[*][*] = 0 v[0][START] = 1 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }
backpointers/ book-keeping computing v at time i-1 will correctly incorporate (maximize over) paths through time i-2: we correctly obey the Markov property v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the
- bservation)
Marginal Probability (via the Forward Algorithm)
α(i, s) is the total probability of all paths:
- 1. that start from the beginning
- 2. that end (currently) in s at step i
- 3. that emit the observation obs at i
𝛽 𝑗, 𝑡 =
𝑡′
𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)
how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?
Q: What do we return? (How do we return the likelihood of the sequence?)
A: α[N+1][end]
There’s an analogous backwards algorithm
With Both Forward and Backward Values
α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the ss’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i
𝑞 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)
EM For HMMs (Baum-Welch Algorithm)
α = computeForwards() β = computeBackwards() L = α[N+1][END] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans (next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } }
Bayesian Networks: Directed Acyclic Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ
𝑗
𝑞 𝑦𝑗 𝜌(𝑦𝑗))
“parents of” topological sort
Bayesian Networks: Directed Acyclic Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ
𝑗
𝑞 𝑦𝑗 𝜌(𝑦𝑗))
exact inference in general DAGs is NP-hard inference in trees can be exact
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
- bserving Z blocks
the path from X to Y
- bserving Z blocks
the path from X to Y not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
- bserving Z blocks
the path from X to Y
- bserving Z blocks
the path from X to Y not observing Z blocks the path from X to Y 𝑞 𝑦, 𝑧, 𝑨 = 𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) 𝑞 𝑦, 𝑧 =
𝑨
𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) = 𝑞 𝑦 𝑞 𝑧
Markov Blanket
x Markov blanket of a node x is its parents, children, and children's parents
𝑞 𝑦𝑗 𝑦𝑘≠𝑗 = 𝑞(𝑦1, … , 𝑦𝑂) ∫ 𝑞 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗
factor out terms not dependent on xi
factorization
- f graph
= ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗
the set of nodes needed to form the complete conditional for a variable xi
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂
clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ
𝐷
𝜔𝐷 𝑦𝑑
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ
𝐷
𝜔𝐷 𝑦𝑑
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ
𝐷
𝜔𝐷 𝑦𝑑
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials 𝜔𝐷?
Markov Random Fields: Undirected Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ
𝐷
𝜔𝐷 𝑦𝑑
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials 𝜔𝐷? A: 𝜔𝐷 ≥ 0 (or 𝜔𝐷 > 0)
Terminology: Potential Functions
𝜔𝐷 𝑦𝑑 = exp −𝐹(𝑦𝐷)
energy function (for clique C) Boltzmann distribution
(get the total energy of a configuration by summing the individual energy functions)
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ
𝐷
𝜔𝐷 𝑦𝑑
Ambiguity in Undirected Model Notation
X Y Z
𝑞 𝑦, 𝑧, 𝑨 ∝ 𝜔(𝑦, 𝑧, 𝑨) 𝑞 𝑦, 𝑧, 𝑨 ∝ 𝜔1 𝑦,𝑧 𝜔2 𝑧,𝑨 𝜔3 𝑦,𝑨
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
Q: What are the cliques?
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
Q: What are the cliques?
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
𝐹 𝑦, 𝑧 = ℎ
𝑗
𝑦𝑗 − 𝛾
𝑗𝑘
𝑦𝑗𝑦𝑘 − 𝜃
𝑗
𝑦𝑗𝑧𝑗
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
𝐹 𝑦, 𝑧 = ℎ
𝑗
𝑦𝑗 − 𝛾
𝑗𝑘
𝑦𝑗𝑦𝑘 − 𝜃
𝑗
𝑦𝑗𝑧𝑗
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
𝐹 𝑦, 𝑧 = ℎ
𝑗
𝑦𝑗 − 𝛾
𝑗𝑘
𝑦𝑗𝑦𝑘 − 𝜃
𝑗
𝑦𝑗𝑧𝑗
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Q: Why subtract β and η?
Example: Ising Model
x: original pixel/state y:
- bserved
(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)
- riginal
w/ 10% noise two solutions
𝐹 𝑦, 𝑧 = ℎ
𝑗
𝑦𝑗 − 𝛾
𝑗𝑘
𝑦𝑗𝑦𝑘 − 𝜃
𝑗
𝑦𝑗𝑧𝑗
xi and yi should be correlated neighboring pixels should be similar allow for a bias
Q: Why subtract β and η? A: Better states lower energy (higher potential) 𝜔𝐷 𝑦𝑑 = exp −𝐹(𝑦𝐷)
Markov Random Fields with Factor Graph Notation
x: original pixel/state y: observed (noisy) pixel/state
factor nodes are added according to maximal cliques
unary factor variable
factor graphs are bipartite
binary factor
Different Factor Graph Notation for the Same Graph
X Y Z X Y Z X Y Z
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected
(e.g., conditional random field [CRF]) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
𝑞 𝑦1, … , 𝑦4 = 𝑞 𝑦1 𝑞 𝑦2 𝑞 𝑦3 𝑞(𝑦4|𝑦1, 𝑦2, 𝑦3)
x1 x2 x3 x4
Directed vs. Undirected Models: Moralization
x1 x2 x3 x4
𝑞 𝑦1, … , 𝑦4 = 𝑞 𝑦1 𝑞 𝑦2 𝑞 𝑦3 𝑞(𝑦4|𝑦1, 𝑦2, 𝑦3)
x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph
Two Problems for Undirected Models
Finding the normalizer Computing the marginals
Two Problems for Undirected Models
Finding the normalizer 𝑎 =
𝑦
ෑ
𝑑
𝜔𝑑(𝑦𝑑) Computing the marginals
Two Problems for Undirected Models
Finding the normalizer 𝑎 =
𝑦
ෑ
𝑑
𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) =
𝑦:𝑦𝑜=𝑤
ෑ
𝑑
𝜔𝑑(𝑦𝑑)
Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) =
𝑦1
𝑦3
ෑ
𝑑
𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension
Two Problems for Undirected Models
Finding the normalizer 𝑎 =
𝑦
ෑ
𝑑
𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) =
𝑦:𝑦𝑜=𝑤
ෑ
𝑑
𝜔𝑑(𝑦𝑑)
Q: Why are these difficult? Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) =
𝑦1
𝑦3
ෑ
𝑑
𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension
Two Problems for Undirected Models
Finding the normalizer 𝑎 =
𝑦
ෑ
𝑑
𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) =
𝑦:𝑦𝑜=𝑤
ෑ
𝑑
𝜔𝑑(𝑦𝑑)
Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) =
𝑦1
𝑦3
ෑ
𝑑
𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension
Sum-Product Algorithm
Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case
Sum-Product
𝑞 𝑦𝑗 = 𝑤 = ෑ
𝑦:𝑦𝑗=𝑤
𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂
definition of marginal
… …
Sum-Product
𝑞 𝑦𝑗 = 𝑤 = ෑ
𝑦:𝑦𝑗=𝑤
𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂
definition of marginal
… …
main idea: use bipartite nature of graph to efficiently compute the marginals
Sum-Product
𝑞 𝑦𝑗 = 𝑤 = ෑ
𝑦:𝑦𝑗=𝑤
𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂
definition of marginal
… …
main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
Sum-Product
𝑞 𝑦𝑗 = 𝑤 = ෑ
𝑔
𝑠
𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation
… …
main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
Sum-Product
𝑞 𝑦𝑗 = 𝑤 = ෑ
𝑔
𝑠
𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation
… …
main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
𝑠
𝑛→𝑜
Sum-Product
From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ
𝑛′∈𝑁(𝑜)\𝑛
𝑠𝑛′→𝑜 𝑦𝑜
n m
set of factors in which variable n participates
default value of 1 if empty product
Sum-Product
From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ
𝑛′∈𝑁(𝑜)\𝑛
𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠
𝑛→𝑜 𝑦𝑜
=
𝒙𝑛\𝑜
𝑔
𝑛 𝒙𝑛
ෑ
𝑜′∈𝑂(𝑛)\𝑜
𝑟𝑜′→𝑛(𝑦𝑜′)
n m n m
set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed
default value of 1 if empty product
Example
x2 x1 x3 x4 f g h
Max-Product (Max-Sum)
Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factorvariable computations
𝑠
𝑛→𝑜 𝑦𝑜 = max 𝒙𝑛\𝑜 𝑔 𝑛 𝒙𝑛
ෑ
𝑜′∈𝑂(𝑛)\𝑜
𝑟𝑜′→𝑛(𝑦𝑜′)
(why max-sum? computationally, implement with logs)