Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC - - PowerPoint PPT Presentation

probabilistic modeling and
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC - - PowerPoint PPT Presentation

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC Course Overview (so far) Basics of Probability Maximum Entropy Models Requirements to be a distribution (proportional to, ) Meanings of feature functions and weights


slide-1
SLIDE 1

Probabilistic Modeling and Expectation Maximization

CMSC 678 UMBC

slide-2
SLIDE 2

Basics of Probability

Requirements to be a distribution (“proportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Expectation (of a random variable & function)

Empirical Risk Minimization

Gradient Descent Loss Functions: what is it, what does it measure, and what are some computational difficulties with them? Regularization: what is it, how does it work, and why might you want it?

Tasks (High Level)

Data set splits: training vs. dev vs. test Classification: Posterior decoding/MAP classifier Classification evaluations: accuracy, precision, recall, and F scores Regression (vs. classification) Comparing supervised vs. Unsupervised Learning and their tradeoffs: why might you want to use one vs. the other, and what are some potential issues? Clustering: high-level goal/task, K-means as an example Tradeoffs among clustering evaluations

Linear Models

Basic form of a linear model (classification or regression) Perceptron (simple vs. other variants, like averaged or voted) When you should use perceptron (what are its assumptions?) Perceptron as SGD

Maximum Entropy Models

Meanings of feature functions and weights How to learn the weights: gradient descent Meaning of the maxent gradient

Neural Networks

Relation to linear models and maxent Types (feedforward, CNN, RNN) Learning representations (e.g., "feature maps”) What is a convolution (e.g., 1D vs 2D, high-level notions of why you might want to change padding or the width) How to learn: gradient descent, backprop Common activation functions Neural network regularization

Dimensionality Reduction

What is the basic task & goal in dimensionality reduction? Dimensionality reduction tradeoffs: why might you want to, and what are some potential issues? Linear Discriminant Analysis vs. Principal Component Analysis: what are they trying to do, how are they similar, how do they differ?

Kernel Methods & SVMs

Feature expansion and kernels Two views: maximizing a separating hyperplane margin vs. loss

  • ptimization (norm minimization)

Non-separability & slack Sub-gradients

Course Overview (so far)

slide-3
SLIDE 3

Remember from the first day: A Terminology Buffet

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar …

the data: amount of human input/number

  • f labeled examples

the approach: how any data are being used the task: what kind

  • f problem are you

solving? what we’ve currently sampled…

slide-4
SLIDE 4

Remember from the first day: A Terminology Buffet

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar …

the data: amount of human input/number

  • f labeled examples

the approach: how any data are being used the task: what kind

  • f problem are you

solving? what we’ve currently sampled… what we’ll be sampling next…

slide-5
SLIDE 5

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-6
SLIDE 6

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

slide-7
SLIDE 7

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

slide-8
SLIDE 8

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

Q: Where have we used p(x,y)?

slide-9
SLIDE 9

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

Q: Where have we used p(x,y)? A: Linear Discriminant Analysis

slide-10
SLIDE 10

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y) Or what if we only have data but no labels? p(x)

  • Like A3, Q1
  • Piazza Q68

Q: Where have we used p(x,y)? A: Linear Discriminant Analysis

slide-11
SLIDE 11

Generative Stories

“A useful way to develop probabilistic models is to tell a generative story. This is a fictional story that explains how you believe your training data came into existence.” --- CIML Ch 9.5

slide-12
SLIDE 12

Generative Stories

Generative stories are most often used with joint models p(x, y)…. but despite their name, generative stories are applicable to both generative and conditional models

“A useful way to develop probabilistic models is to tell a generative story. This is a fictional story that explains how you believe your training data came into existence.” --- CIML Ch 9.5

slide-13
SLIDE 13

p(x, y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating:

  • 1. Directly

2.

slide-14
SLIDE 14

p(x, y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating:

  • 1. Directly
  • 2. Using Bayes rule: p(x, y) = p(x | y)p(y)

Using Bayes rule transparently provides a generative story for how our data x and labels y are generated

slide-15
SLIDE 15

p(x,y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating: 1. Directly 2. Using Bayes rule: p(x, y) = p(x | y)p(y) Using Bayes rule transparently provides a generative story for how our data x and labels y are generated p(y | x) is the conditional distribution Two main options for estimating: 1. Directly: used when you only care about making the right prediction

Examples: perceptron, logistic regression, neural networks (we’ve covered)

2.

slide-16
SLIDE 16

p(x,y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating: 1. Directly 2. Using Bayes rule: p(x, y) = p(x | y)p(y) Using Bayes rule transparently provides a generative story for how our data x and labels y are generated p(y | x) is the conditional distribution Two main options for estimating: 1. Directly: used when you only care about making the right prediction

Examples: perceptron, logistic regression, neural networks (we’ve covered)

2. Estimate the joint

slide-17
SLIDE 17

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Roles EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-18
SLIDE 18

Example: Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

slide-19
SLIDE 19

Example: Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

slide-20
SLIDE 20

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂:

Generative Story

slide-21
SLIDE 21

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

slide-22
SLIDE 22

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story “for each” loop becomes a product Calculate 𝑞 𝑥𝑗 according to provided distribution

slide-23
SLIDE 23

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story a probability distribution over 6 sides of the die ෍

𝑙=1 6

𝜄𝑙 = 1 0 ≤ 𝜄𝑙 ≤ 1, ∀𝑙 “for each” loop becomes a product Calculate 𝑞 𝑥𝑗 according to provided distribution

slide-24
SLIDE 24

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do?

slide-25
SLIDE 25

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe

slide-26
SLIDE 26

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete

  • bservations) What loss

function do we minimize to maximize log-likelihood?

slide-27
SLIDE 27

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete

  • bservations) What loss

function do we minimize to maximize log-likelihood? A: Cross-entropy

slide-28
SLIDE 28

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

p(1) = ? p(3) = ? p(5) = ? p(2) = ? p(4) = ? p(6) = ?

If you observe these 9 rolls… …what are “reasonable” estimates for p(w)?

slide-29
SLIDE 29

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

p(1) = 2/9 p(3) = 1/9 p(5) = 1/9 p(2) = 1/9 p(4) = 3/9 p(6) = 1/9 maximum likelihood estimates

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

If you observe these 9 rolls… …what are “reasonable” estimates for p(w)?

slide-30
SLIDE 30

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝑞𝜄(𝑥𝑗) = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood

slide-31
SLIDE 31

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

slide-32
SLIDE 32

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄𝑙 (we know 𝜄 must be a distribution, but it’s not specified)

slide-33
SLIDE 33

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 s. t. ෍

𝑙=1 6

𝜄𝑙 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

solve using Lagrange multipliers

slide-34
SLIDE 34

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜖ℱ 𝜄 𝜖𝜄𝑙 = ෍

𝑗:𝑥𝑗=𝑙

1 𝜄𝑥𝑗 − 𝜇 𝜖ℱ 𝜄 𝜖𝜇 = − ෍

𝑙=1 6

𝜄𝑙 + 1

slide-35
SLIDE 35

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 𝜇

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-36
SLIDE 36

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 σ𝑙 σ𝑗:𝑥𝑗=𝑙 1 = 𝑂𝑙 𝑂

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-37
SLIDE 37

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-38
SLIDE 38

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

slide-39
SLIDE 39

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

⋯ 𝑨1 = 𝑈 𝑨2 = 𝐼

First flip a coin…

add complexity to better explain what we see

slide-40
SLIDE 40

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑥1 = 1 𝑥2 = 5 ⋯ 𝑨1 = 𝑈 𝑨2 = 𝐼

First flip a coin… …then roll a different die depending on the coin flip

slide-41
SLIDE 41

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

If you observe the 𝑨𝑗 values, this is easy!

slide-42
SLIDE 42

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for die when coin comes up heads 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for die when coin comes up tails for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

slide-43
SLIDE 43

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙 2

𝜀𝑙 ෍

𝑘 6

𝛿𝑘

(𝑙) − 1

Lagrange multiplier constraints

slide-44
SLIDE 44

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙=1 2

𝜀𝑙 ෍

𝑘=1 6

𝛿𝑘

(𝑙) − 1

slide-45
SLIDE 45

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙=1 2

𝜀𝑙 ෍

𝑘=1 6

𝛿𝑘

(𝑙) − 1

But if you don’t observe the 𝑨𝑗 values, this is not easy!

slide-46
SLIDE 46

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( we don’t actually observe these z values we just see the items w goal: maximize (log-)likelihood

slide-47
SLIDE 47

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

we don’t actually observe these z values we just see the items w goal: maximize (log-)likelihood if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-48
SLIDE 48

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

slide-49
SLIDE 49

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

slide-50
SLIDE 50

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

slide-51
SLIDE 51

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1

𝑞(𝑨1, 𝑥) ෍

𝑨2

𝑞(𝑨2, 𝑥) ⋯ ෍

𝑨𝑂

𝑞(𝑨𝑂, 𝑥)

slide-52
SLIDE 52

Example: Conditionally Rolling a Die

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂

goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1

𝑞(𝑨1, 𝑥) ෍

𝑨2

𝑞(𝑨2, 𝑥) ⋯ ෍

𝑨𝑂

𝑞(𝑨𝑂, 𝑥)

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-53
SLIDE 53

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-54
SLIDE 54

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-55
SLIDE 55

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

Expectation Maximization:

give you model estimation the needed “spark”

slide-56
SLIDE 56

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-57
SLIDE 57

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-58
SLIDE 58

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗)

slide-59
SLIDE 59

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗)

We’ve already seen this type of counting, when computing the gradient in maxent models.

slide-60
SLIDE 60

Expectation Maximization (EM): M-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

𝑞 𝑢+1 (𝑨) 𝑞(𝑢)(𝑨)

estimated counts

slide-61
SLIDE 61

EM Math

max 𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-62
SLIDE 62

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-63
SLIDE 63

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-64
SLIDE 64

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

current parameters

slide-65
SLIDE 65

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-66
SLIDE 66

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

E-step: count under uncertainty M-step: maximize log-likelihood

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-67
SLIDE 67

Why EM? Un-Supervised Learning

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

NO labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

EM/generative models in this case can be seen as a type

  • f clustering

EM

slide-68
SLIDE 68

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-69
SLIDE 69

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

EM

slide-70
SLIDE 70

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-71
SLIDE 71

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

EM

slide-72
SLIDE 72

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-73
SLIDE 73

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

slide-74
SLIDE 74

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • nly observe these

(record heads vs. tails

  • utcome)

don’t observe this

slide-75
SLIDE 75

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: part of speech? genre?

slide-76
SLIDE 76

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

slide-77
SLIDE 77

Three Coins Example

Imagine three coins

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 Three parameters to estimate: λ, γ, and ψ

slide-78
SLIDE 78

Generative Story for Three Coins

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Generative Story

𝜇 = distribution over penny 𝛿 = distribution for dollar coin 𝜔 = distribution over dime if 𝑨𝑗 = 𝐼: 𝑥𝑗 ~ Bernoulli 𝛿 else: 𝑥𝑗 ~ Bernoulli 𝜔

slide-79
SLIDE 79

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

slide-80
SLIDE 80

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

𝑞 heads = 4 6 𝑞 tails = 2 6 𝑞 heads = 1 4 𝑞 heads = 1 2 𝑞 tails = 3 4 𝑞 tails = 1 2 𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

slide-81
SLIDE 81

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

slide-82
SLIDE 82

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4 𝑞 heads | observed item H = 𝑞(heads & H) 𝑞(H)

Use these values to compute posteriors

𝑞 heads | observed item T = 𝑞(heads & T) 𝑞(T)

slide-83
SLIDE 83

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H)

Use these values to compute posteriors

marginal likelihood rewrite joint using Bayes rule

slide-84
SLIDE 84

Three Coins Example

But not all flips are observed → set parameter values H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H)

Use these values to compute posteriors

𝑞 H | heads = .8 𝑞 T | heads = .2

slide-85
SLIDE 85

Three Coins Example

But not all flips are observed → set parameter values H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

Use these values to compute posteriors

𝑞 H = 𝑞 H | heads ∗ 𝑞 heads + 𝑞 H | tails * 𝑞(tails) = .8 ∗ .6 + .6 ∗ .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2

slide-86
SLIDE 86

Three Coins Example

H H T H T H H T H T T T

𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667

Use posteriors to update parameters

𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

slide-87
SLIDE 87

Three Coins Example

H H T H T H H T H T T T

𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667

Use posteriors to update parameters

𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

slide-88
SLIDE 88

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

𝑞 heads = # heads from penny # total flips of penny fully observed setting

  • ur setting: partially-observed

𝑞 heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-89
SLIDE 89

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting: partially-observed

𝑞(𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny = 𝔽𝑞(𝑢)[# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] # total flips of penny 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

slide-90
SLIDE 90

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting:

partially-

  • bserved

𝑞(𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny = 𝔽𝑞(𝑢)[# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] # total flips of penny = 2 ∗ 𝑞 heads | obs. H + 4 ∗ 𝑞 heads | obs. 𝑈 6 ≈ 0.444 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

slide-91
SLIDE 91

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm:

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-92
SLIDE 92

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-93
SLIDE 93

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

what do 𝒟, ℳ, 𝒬 look like?

slide-94
SLIDE 94

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗)

slide-95
SLIDE 95

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙)

slide-96
SLIDE 96

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙) 𝒬 𝜄 = ෍

𝑗

log 𝑞 𝑧𝑗 𝑦𝑗)

slide-97
SLIDE 97

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y definition of conditional probability algebra ℳ 𝜄 = marginal log-likelihood of

  • bserved data X
slide-98
SLIDE 98

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗,𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙) 𝒬 𝜄 = ෍

𝑗

log 𝑞 𝑧𝑗 𝑦𝑗)

slide-99
SLIDE 99

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄 𝔽𝑍∼𝜄(𝑢)[ℳ 𝜄 |𝑌] = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

take a conditional expectation (why? we’ll cover this more in variational inference)

slide-100
SLIDE 100

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄 𝔽𝑍∼𝜄(𝑢)[ℳ 𝜄 |𝑌] = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌] ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

ℳ already sums over Y

ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙)

slide-101
SLIDE 101

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌] 𝔽𝑍∼𝜄(𝑢) 𝒟 𝜄 𝑌 = ෍

𝑗

𝑙

𝑞𝜄(𝑢) 𝑧 = 𝑙 𝑦𝑗) log 𝑞(𝑦𝑗, 𝑧 = 𝑙)

slide-102
SLIDE 102

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

slide-103
SLIDE 103

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢),𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

slide-104
SLIDE 104

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢),𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

≥ 0 ≤ 0 (we’ll see why with Jensen’s

inequality, in variational inference)

slide-105
SLIDE 105

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

  • bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢),𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 ≥ 0 EM does not decrease the marginal log-likelihood

slide-106
SLIDE 106

Generalized EM

Partial M step: find a θ that simply increases, rather than maximizes, Q Partial E step: only consider some of the variables (an online learning algorithm)

slide-107
SLIDE 107

EM has its pitfalls

Objective is not convex → converge to a bad local optimum Computing expectations can be hard: the E-step could require clever algorithms How well does log-likelihood correlate with an end task?

slide-108
SLIDE 108

A Maximization-Maximization Procedure

𝐺 𝜄, 𝑟 = 𝔽 𝒟(𝜄) −𝔽 log 𝑟(𝑎)

  • bserved data

log-likelihood any distribution

  • ver Z

we’ll see this again with variational inference

slide-109
SLIDE 109

Outline

Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works