Directed Probabilistic Graphical Models
CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement - - PowerPoint PPT Presentation
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th , 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th , 11:59 AM Build on the proposal: Update
Directed Probabilistic Graphical Models
CMSC 678 UMBC
Announcement 1: Assignment 3
Due Wednesday April 11th, 11:59 AM Any questions?
Announcement 2: Progress Report on Project
Due Monday April 16th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you’ve made Discuss what remains to be done Discuss any new blocks you’ve experienced (or anticipate experiencing) Any questions?
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Expectation Maximization (EM): E-step
Two step, iterative algorithm
parameters
uncertain counts
count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗) 𝑞 𝑢+1 (𝑨) 𝑞(𝑢)(𝑨)
estimated counts
http://blog.innotas.com/wp-EM Math
𝜄
E-step: count under uncertainty M-step: maximize log-likelihood
new parameters new parameters posterior distribution 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log- likelihood of incomplete data Y ℳ 𝜄 = marginal log- likelihood of observed data X
ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]
EM does not decrease the marginal log-likelihood
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Assume an original optimization problem
Lagrange multipliers
Assume an original optimization problem We convert it to a new optimization problem:
Lagrange multipliers
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯
for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)
Generative Story a probability distribution over 6 sides of the die
𝑙=1 6
𝜄𝑙 = 1 0 ≤ 𝜄𝑙 ≤ 1, ∀𝑙
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯
for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)
Generative Story
ℒ 𝜄 =
𝑗
log 𝑞𝜄(𝑥𝑗) =
𝑗
log 𝜄𝑥𝑗
Maximize Log-likelihood
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)
Generative Story
ℒ 𝜄 =
𝑗
log 𝜄𝑥𝑗
Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)?
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)
Generative Story
ℒ 𝜄 =
𝑗
log 𝜄𝑥𝑗
Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄𝑙 (we know 𝜄 must be a distribution, but it’s not specified)
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
ℒ 𝜄 =
𝑗
log 𝜄𝑥𝑗 s. t.
𝑙=1 6
𝜄𝑙 = 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)
solve using Lagrange multipliers
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
ℱ 𝜄 =
𝑗
log 𝜄𝑥𝑗 − 𝜇
𝑙=1 6
𝜄𝑙 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)
𝜖ℱ 𝜄 𝜖𝜄𝑙 =
𝑗:𝑥𝑗=𝑙
1 𝜄𝑥𝑗 − 𝜇 𝜖ℱ 𝜄 𝜖𝜇 = −
𝑙=1 6
𝜄𝑙 + 1
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
ℱ 𝜄 =
𝑗
log 𝜄𝑥𝑗 − 𝜇
𝑙=1 6
𝜄𝑙 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)
𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 𝜇
𝑙=1 6
𝜄𝑙 = 1
Probabilistic Estimation of Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗
N different (independent) rolls
ℱ 𝜄 =
𝑗
log 𝜄𝑥𝑗 − 𝜇
𝑙=1 6
𝜄𝑙 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)
𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 σ𝑙 σ𝑗:𝑥𝑗=𝑙 1 = 𝑂𝑙 𝑂
𝑙=1 6
𝜄𝑙 = 1
Example: Conditionally Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗
add complexity to better explain what we see
𝑥1 = 1 𝑥2 = 5 ⋯ 𝑨1 = 𝐼 𝑨2 = 𝑈
𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔
Example: Conditionally Rolling a Die
𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ
𝑗
𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗
add complexity to better explain what we see
𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇
Generative Story
𝜇 = distribution over penny 𝛿 = distribution for dollar coin 𝜔 = distribution over dime if 𝑨𝑗 = 𝐼: 𝑥𝑗 ~ Bernoulli 𝛿 else: 𝑥𝑗 ~ Bernoulli 𝜔
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Classify with Bayes Rule argmax𝑍 log 𝑞 𝑌 𝑍) + log 𝑞(𝑍)
likelihood prior
argmax𝑍𝑞 𝑍 𝑌)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
29
Adapted from Jurafsky & Martin (draft)
Bag of Words Representation
seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
classifier classifier
Adapted from Jurafsky & Martin (draft)
Naïve Bayes: A Generative Story
Generative Story
𝜚 = distribution over 𝐿 labels for label 𝑙 = 1 to 𝐿:
global parameters
𝜄𝑙 = generate parameters
Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
𝜚 = distribution over 𝐿 labels
y
for label 𝑙 = 1 to 𝐿: 𝜄𝑙 = generate parameters
Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5
y
for label 𝑙 = 1 to 𝐿: 𝜄𝑙 = generate parameters
local variables
Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
𝜚 = distribution over 𝐿 labels
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5
y
for label 𝑙 = 1 to 𝐿:
each xij conditionally independent of one another (given the label)
𝜄𝑙 = generate parameters for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)
Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
ℒ 𝜄 =
𝑗
𝑘
log 𝐺
𝑧𝑗(𝑦𝑗𝑘; 𝜄𝑧𝑗) + 𝑗
log 𝜚𝑧𝑗 s. t.
Maximize Log-likelihood
𝜚 = distribution over 𝐿 labels
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5
y
for label 𝑙 = 1 to 𝐿:
𝑙
𝜚𝑙 = 1 𝜄𝑙 is valid for 𝐺
𝑙
𝜄𝑙 = generate parameters for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)
Multinomial Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
ℒ 𝜄 =
𝑗
𝑘
log 𝜄𝑧𝑗,𝑦𝑗,𝑘 +
𝑗
log 𝜚𝑧𝑗 s. t.
Maximize Log-likelihood
𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ Cat(𝜄𝑧𝑗,𝑘)
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5
y
for label 𝑙 = 1 to 𝐿:
𝜄𝑙 = distribution over J feature values
𝑙
𝜚𝑙 = 1
𝑘
𝜄𝑙𝑘 = 1 ∀𝑙
Multinomial Naïve Bayes: A Generative Story
for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚
Generative Story
ℒ 𝜄 =
𝑗
𝑘
log 𝜄𝑧𝑗,𝑦𝑗,𝑘 +
𝑗
log 𝜚𝑧𝑗 − 𝜈
𝑙
𝜚𝑙 − 1 −
𝑙
𝜇𝑙
𝑘
𝜄𝑙𝑘 − 1
Maximize Log-likelihood via Lagrange Multipliers
𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ Cat(𝜄𝑧𝑗,𝑘)
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5
y
for label 𝑙 = 1 to 𝐿:
𝜄𝑙 = distribution over J feature values
Multinomial Naïve Bayes: Learning
Calculate class priors For each k:
itemsk = all items with class = k
Calculate feature generation terms For each k:
items labeled as k Foreach feature j nkj = # of occurrences of j in obsk
𝑞 𝑙 = |items𝑙| # items 𝑞 𝑘|𝑙 = 𝑜𝑙𝑘 σ𝑘′ 𝑜𝑙𝑘′
Brill and Banko (2001) With enough data, the classifier may not matter
Adapted from Jurafsky & Martin (draft)
Summary: Naïve Bayes is Not So Naïve, but not without issue
Pro
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
Con
Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Adapted from Jurafsky & Martin (draft)
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Hidden Markov Models
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):
Class-based model Bigram model
Model all class sequences
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗|𝑨𝑗−1
𝑨1,..,𝑨𝑂
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂
Hidden Markov Model
Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
Hidden Markov Model
Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
Q: How many different probability values are there with K states and V vocab items?
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K2 transition values
Hidden Markov Model Representation
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
represent the probabilities and independence assumptions in a graph
Hidden Markov Model Representation
𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ
𝑗
𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1
emission probabilities/parameters transition probabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4
𝑞 𝑨2| 𝑨1 𝑞 𝑨3| 𝑨2 𝑞 𝑨4| 𝑨3 𝑞 𝑨1| 𝑨0 initial starting distribution (“__SEQ_SYM__”)
Each zi can take the value of one of K latent states Transition and emission distributions do not change
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V
…
N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂
z2 = N z3 = N z4 = N z1 = V z2 = V z4 = V
…
𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊
z3 = V
N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂
𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start
…
𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊
𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂
N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂
𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start
…
𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊
𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊
𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂
N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
A Latent Sequence is a Path through the Graph
z1 = N
w1 w2 w3 w4
𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂
𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊
𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add
number to the soldier on the
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4 +3 10 +3 7
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4 +3 10 +3 7
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4 +3 10 +3 7 +10 19 +10 16 +10 42 +10 11
What’s the Maximum Weighted Path?
9 6 7 3 32 1 4 +3 10 +3 7 +10 19 +10 16 +10 42 +10 11
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values
𝑤 𝑗, 𝐶 = max
𝑡′ 𝑤 𝑗 − 1, 𝑡′
∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)
v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
𝑤 𝑗, 𝐶 = max
𝑡′ 𝑤 𝑗 − 1, 𝑡′
∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)
v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
computing v at time i-1 will correctly incorporate (maximize over) paths through time i-2: we correctly obey the Markov property
𝑤 𝑗, 𝐶 = max
𝑡′ 𝑤 𝑗 − 1, 𝑡′
∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)
v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the
Viterbi algorithm
Viterbi Algorithm
v = double[N+2][K*] b = int[N+2][K*] v[*][*] = 0 v[0][START] = 1 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }
backpointers/ book-keeping
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Forward Probability
α(i, B) is the total probability of all paths to that state B from the beginning
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
Forward Probability
marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 =
𝑡′
𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
Forward Probability
α(i, s) is the total probability of all paths:
𝛽 𝑗, 𝑡 =
𝑡′
𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)
Forward Probability
α(i, s) is the total probability of all paths:
𝛽 𝑗, 𝑡 =
𝑡′
𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)
how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?
Forward Probability
α(i, s) is the total probability of all paths:
𝛽 𝑗, 𝑡 =
𝑡′
𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)
how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?
Q: What do we return? (How do we return the likelihood of the sequence?)
A: α[N+1][end]
Outline
Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs
Forward & Backward Message Passing
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Forward & Backward Message Passing
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
α(i, B) β(i, B)
Forward & Backward Message Passing
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i
Forward & Backward Message Passing
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
α(i, B) β(i+1, s)
zi+1 = C zi+1 = B zi+1 = A
Forward & Backward Message Passing
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
α(i, B) β(i+1, s’)
zi+1 = C zi+1 = B zi+1 = A
α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the Bs’ arc (at time i)
With Both Forward and Backward Values
α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the ss’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i
𝑞 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)
Expectation Maximization (EM)
Two step, iterative algorithm
parameters
uncertain counts
estimated counts
Expectation Maximization (EM)
Two step, iterative algorithm
parameters
uncertain counts
estimated counts
pobs(w | s) ptrans(s’ | s)
Expectation Maximization (EM)
Two step, iterative algorithm
parameters
uncertain counts
estimated counts
pobs(w | s) ptrans(s’ | s)
𝑞∗ 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞∗ 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑞new 𝑡′ 𝑡) = 𝑑(𝑡 → 𝑡′) σ𝑦 𝑑(𝑡 → 𝑦)
if we observed the hidden transitions…
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑞new 𝑡′ 𝑡) = 𝔽𝑡→𝑡′[𝑑 𝑡 → 𝑡′ ] σ𝑦 𝔽𝑡→𝑦[𝑑 𝑡 → 𝑦 ]
we don’t the hidden transitions, but we can approximately count
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑞new 𝑡′ 𝑡) = 𝔽𝑡→𝑡′[𝑑 𝑡 → 𝑡′ ] σ𝑦 𝔽𝑡→𝑦[𝑑 𝑡 → 𝑦 ]
we don’t the hidden transitions, but we can approximately count
we compute these in the E-step, with
EM For HMMs (Baum-Welch Algorithm)
α = computeForwards() β = computeBackwards() L = α[N+1][END] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans (next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } }
Bayesian Networks: Directed Acyclic Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ
𝑗
𝑞 𝑦𝑗 𝜌(𝑦𝑗))
“parents of” topological sort
Bayesian Networks: Directed Acyclic Graphs
𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ
𝑗
𝑞 𝑦𝑗 𝜌(𝑦𝑗))
exact inference in general DAGs is NP-hard inference in trees can be exact
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
the path from X to Y
the path from X to Y not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y
the path from X to Y
the path from X to Y not observing Z blocks the path from X to Y 𝑞 𝑦, 𝑧, 𝑨 = 𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) 𝑞 𝑦, 𝑧 =
𝑨𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) = 𝑞 𝑦 𝑞 𝑧
Markov Blanket
x Markov blanket of a node x is its parents, children, and children's parents
𝑞 𝑦𝑗 𝑦𝑘≠𝑗 = 𝑞(𝑦1, … , 𝑦𝑂) ∫ 𝑞 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗
factor out terms not dependent on xi
factorization
= ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗
the set of nodes needed to form the complete conditional for a variable xi