Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC - PowerPoint PPT Presentation

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?

Learning Parameters for the Die Model: Maximum Likelihood (Intuition) 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are “reasonable” these 9 rolls… estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?

Learning Parameters for the Die Model: Maximum Likelihood (Intuition) 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are “reasonable” these 9 rolls… estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story 𝑥 1 = 1 for roll 𝑗 = 1 to 𝑂: 𝑥 𝑗 ∼ Cat(𝜄) 𝑥 2 = 5 Maximize Log-likelihood 𝑥 3 = 4 ℒ 𝜄 = ෍ log 𝑞 𝜄 (𝑥 𝑗 ) 𝑗 ⋯ = ෍ log 𝜄 𝑥 𝑗 𝑗

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 𝑥 𝑗 ∼ Cat(𝜄) 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 𝑥 𝑗 ∼ Cat(𝜄) 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄 𝑙 ( we know 𝜄 must be a distribution, but it’s not specified)

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 s. t. ෍ 𝜄 𝑙 = 1 0 ≤ 𝜄 𝑙 , but it complicates the problem and, right 𝑗 𝑙=1 now , is not needed) solve using Lagrange multipliers

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≤ 𝜄 𝑙 , but it ℱ 𝜄 = ෍ log 𝜄 𝑥 𝑗 − 𝜇 ෍ 𝜄 𝑙 − 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 𝜖ℱ 𝜄 1 𝜖ℱ 𝜄 = ෍ − 𝜇 = − ෍ 𝜄 𝑙 + 1 𝜖𝜄 𝑙 𝜄 𝑥 𝑗 𝜖𝜇 𝑗:𝑥 𝑗 =𝑙 𝑙=1

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≤ 𝜄 𝑙 , but it ℱ 𝜄 = ෍ log 𝜄 𝑥 𝑗 − 𝜇 ෍ 𝜄 𝑙 − 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 σ 𝑗:𝑥 𝑗 =𝑙 1 𝜄 𝑙 = optimal 𝜇 when ෍ 𝜄 𝑙 = 1 𝜇 𝑙=1

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≤ 𝜄 𝑙 , but it ℱ 𝜄 = ෍ log 𝜄 𝑥 𝑗 − 𝜇 ෍ 𝜄 𝑙 − 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 σ 𝑗:𝑥 𝑗 =𝑙 1 σ 𝑙 σ 𝑗:𝑥 𝑗 =𝑙 1 = 𝑂 𝑙 𝜄 𝑙 = optimal 𝜇 when ෍ 𝜄 𝑙 = 1 𝑂 𝑙=1

Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

Example: Conditionally Rolling a Die 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗

Example: Conditionally Rolling a Die 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 First flip a coin… 𝑨 1 = 𝑈 𝑨 2 = 𝐼 ⋯

Example: Conditionally Rolling a Die 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 …then roll a different die First flip a coin… depending on the coin flip 𝑨 1 = 𝑈 𝑥 1 = 1 𝑨 2 = 𝐼 𝑥 2 = 5 ⋯

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy!

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! First: Write the Generative Story 𝜇 = distribution over coin (z) 𝛿 (𝐼) = distribution for die when coin comes up heads 𝛿 (𝑈) = distribution for die when coin comes up tails for item 𝑗 = 1 to 𝑂: 𝑨 𝑗 ~ Bernoulli 𝜇 𝑥 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! Second: Generative Story → Objective First: Write the Generative Story 𝜇 = distribution over coin (z) 𝑜 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) ℱ 𝜄 = ෍ (log 𝜇 𝑨 𝑗 + log 𝛿 𝑥 𝑗 𝛿 (𝑈) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: Lagrange multiplier (𝑙) − 1 −𝜃 ෍ 𝜇 𝑙 − 1 − ෍ 𝜀 𝑙 ෍ 𝛿 𝑘 𝑨 𝑗 ~ Bernoulli 𝜇 constraints 𝑙=1 𝑙 𝑘 𝑥 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! Second: Generative Story → Objective First: Write the Generative Story 𝜇 = distribution over coin (z) 𝑜 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) ℱ 𝜄 = ෍ (log 𝜇 𝑨 𝑗 + log 𝛿 𝑥 𝑗 𝛿 (𝑈) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: (𝑙) − 1 −𝜃 ෍ 𝜇 𝑙 − 1 − ෍ 𝜀 𝑙 ෍ 𝛿 𝑘 𝑨 𝑗 ~ Bernoulli 𝜇 𝑙=1 𝑙=1 𝑘=1 𝑥 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 But if you don’t observe the 𝑨 𝑗 values, this is not easy! values, this is easy! Second: Generative Story → Objective First: Write the Generative Story 𝜇 = distribution over coin (z) 𝑜 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) ℱ 𝜄 = ෍ (log 𝜇 𝑨 𝑗 + log 𝛿 𝑥 𝑗 𝛿 (𝑈) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: (𝑙) − 1 −𝜃 ෍ 𝜇 𝑙 − 1 − ෍ 𝜀 𝑙 ෍ 𝛿 𝑘 𝑨 𝑗 ~ Bernoulli 𝜇 𝑙=1 𝑙=1 𝑘=1 𝑥 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 goal: maximize (log-)likelihood we don’t actually observe these z values we just see the items w if we did observe z , estimating the probability parameters would be easy… but we don’t! :(

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 goal: maximize (log-)likelihood we don’t actually observe these z values we just see the items w if we did observe z , estimating the if we knew the probability parameters probability parameters would be easy… then we could estimate z and evaluate but we don’t! :( likelihood… but we don’t! :(

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z 1 & w z 2 & w z 3 & w z 4 & w 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = ෍ 𝑞(𝑨 1 , 𝑥) ෍ 𝑞(𝑨 2 , 𝑥) ⋯ ෍ 𝑞(𝑨 𝑂 , 𝑥) 𝑨 1 𝑨 2 𝑨 𝑂

Example: Conditionally Rolling a Die 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = ෍ 𝑞(𝑨 1 , 𝑥) ෍ 𝑞(𝑨 2 , 𝑥) ⋯ ෍ 𝑞(𝑨 𝑂 , 𝑥) 𝑨 1 𝑨 2 𝑨 𝑂 if we did observe z , estimating the if we knew the probability parameters probability parameters would be easy… then we could estimate z and evaluate but we don’t! :( likelihood… but we don’t! :(

if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z , estimating the probability parameters would be easy… but we don’t! :(

if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z , estimating the probability parameters would be easy… but we don’t! :( Expectation Maximization : give you model estimation the needed “spark” http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , 𝑥 𝑗 ) 𝑞(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , 𝑥 𝑗 ) 𝑞(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these We’ve already seen this type of counting, when uncertain counts computing the gradient in maxent models.

Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts 𝑞 𝑢+1 (𝑨) 𝑞 (𝑢) (𝑨) estimated counts

EM Math the average log-likelihood of our complete data (z, w), averaged across max 𝔽 𝑨 ~ 𝑞 𝜄 (𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) all z and according to how likely our current model thinks z is

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄 (𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) 𝜄

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄 (𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 posterior distribution

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄 (𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 new parameters posterior distribution new parameters

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄 (𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood

Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: ➔ ? ? ? • human annotated EM • relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: • raw; not annotated • plentiful

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ? EM  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? EM

Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)

Three Coins Example Imagine three coins Flip 1 st coin (penny) don’t observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)

Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed

Three Coins Example Imagine three coins Flip 1 st coin (penny) 𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 If heads: flip 2 nd coin (dollar coin) 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝛿 If tails: flip 3 rd coin (dime) 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝜔

Three Coins Example Imagine three coins 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 Three parameters to estimate: λ , γ , and ψ

Generative Story for Three Coins 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 Generative Story 𝑞 heads = 𝜇 𝜇 = distribution over penny 𝑞 tails = 1 − 𝜇 𝛿 = distribution for dollar coin 𝜔 = distribution over dime 𝑞 heads = 𝛿 for item 𝑗 = 1 to 𝑂: 𝑞 tails = 1 − 𝛿 𝑨 𝑗 ~ Bernoulli 𝜇 𝑞 heads = 𝜔 if 𝑨 𝑗 = 𝐼: 𝑥 𝑗 ~ Bernoulli 𝛿 else: 𝑥 𝑗 ~ Bernoulli 𝜔 𝑞 tails = 1 − 𝜔

Three Coins Example H H T H T H H T H T T T If all flips were observed 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

Three Coins Example H H T H T H H T H T T T If all flips were observed 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 𝑞 heads = 4 𝑞 heads = 1 𝑞 heads = 1 6 4 2 𝑞 tails = 2 𝑞 tails = 3 𝑞 tails = 1 6 4 2

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞(heads & H) 𝑞(H) 𝑞 heads | observed item T = 𝑞(heads & T) 𝑞(T)

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) marginal likelihood

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2 𝑞 H = 𝑞 H | heads ∗ 𝑞 heads + 𝑞 H | tails * 𝑞(tails) = .8 ∗ .6 + .6 ∗ .4

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1) 𝑞 heads = # heads from penny fully observed setting # total flips of penny 𝑞 heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny our setting: partially-observed # total flips of penny

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 (𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny 𝔽 𝑞 (𝑢) [# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] our setting: partially-observed = # total flips of penny

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 (𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny our setting: 𝔽 𝑞 (𝑢) [# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] partially- = # total flips of penny observed = 2 ∗ 𝑞 heads | obs. H + 4 ∗ 𝑞 heads | obs. 𝑈 6 ≈ 0.444

Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm: 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y what do 𝒟 , ℳ , 𝒬 look like?

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝒟 𝜄 = ෍ log 𝑞(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝒟 𝜄 = ෍ log 𝑞(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗 ℳ 𝜄 = ෍ log 𝑞(𝑦 𝑗 ) = ෍ log ෍ 𝑞(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑙

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝒟 𝜄 = ෍ log 𝑞(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗 ℳ 𝜄 = ෍ log 𝑞(𝑦 𝑗 ) = ෍ log ෍ 𝑞(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑙 𝒬 𝜄 = ෍ log 𝑞 𝑧 𝑗 𝑦 𝑗 ) 𝑗

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝑞 𝜄 𝑍 𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) 𝑞 𝜄 𝑍 𝑌) algebra definition of conditional probability

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝑞 𝜄 𝑍 𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) 𝑞 𝜄 𝑍 𝑌) 𝒬 𝜄 = ෍ log 𝑞 𝑧 𝑗 𝑦 𝑗 ) 𝒟 𝜄 = ෍ log 𝑞(𝑦 𝑗 ,𝑧 𝑗 ) ℳ 𝜄 = ෍ log𝑞(𝑦 𝑗 ) = ෍ log ෍ 𝑞(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑗 𝑗 𝑙 ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝑞 𝜄 𝑍 𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) 𝑞 𝜄 𝑍 𝑌) ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄 𝔽 𝑍∼𝜄 (𝑢) [ℳ 𝜄 |𝑌] = 𝔽 𝑍∼𝜄 (𝑢) [𝒟 𝜄 |𝑌] − 𝔽 𝑍∼𝜄 (𝑢) [𝒬 𝜄 |𝑌] take a conditional expectation (why? we’ll cover this more in variational inference)

Why does EM work? 𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) ℳ 𝜄 = marginal log-likelihood of 𝒬 𝜄 = posterior log-likelihood of observed data X incomplete data Y 𝑞 𝜄 𝑍 𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) = 𝑞 𝜄 (𝑌, 𝑍) 𝑞 𝜄 (𝑌) 𝑞 𝜄 𝑍 𝑌) ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄 𝔽 𝑍∼𝜄 (𝑢) [ℳ 𝜄 |𝑌] = 𝔽 𝑍∼𝜄 (𝑢) [𝒟 𝜄 |𝑌] − 𝔽 𝑍∼𝜄 (𝑢) [𝒬 𝜄 |𝑌] ℳ 𝜄 = 𝔽 𝑍∼𝜄 (𝑢) [𝒟 𝜄 |𝑌] − 𝔽 𝑍∼𝜄 (𝑢) [𝒬 𝜄 |𝑌] ℳ already ℳ 𝜄 = ෍ log 𝑞(𝑦 𝑗 ) = ෍ log ෍ 𝑞(𝑦 𝑗 , 𝑧 = 𝑙) sums over Y 𝑗 𝑗 𝑙

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC - PowerPoint PPT Presentation

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC Course Overview (so far) Basics of Probability Maximum Entropy Models Requirements to be a distribution (proportional to, ) Meanings of feature functions and weights

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Probabilistic Models of Human Parsing Parser Architectures Informatics 2A: Lecture 23 2

Probabilistic Forecasting with DeepAR and AWS SageMaker EuroPython 2020 - Probabilistic

A Component-Based Reduced Basis Method for Many-Parameter Systems David J. Knezevic Harvard

A Geometric Index Reduction Method for DAE Systems Gabriela Jeronimo (1) Joint work with L.

Behavioral Modeling Verilog Synthesis Examples Using continuous assignments ISE can build

Fair Regression: Quantitative Definitions and Reduction- Based Algorithms Steven Wu (University

Lecture 8 - Variance Reduction Welcome! , = (, )

Reduce Your Taxes With 471 Allocations With Jim Breese CMO at GreenGrowth CPAs Disclaimer The

Known-item search 1 @ TRECVID 2012 Alan Smeaton Dublin City University Paul Over NIST Task

Lobbying and Corruption Dr James Tremewan (james.tremewan@univie.ac.at) An Embezzlement Game