MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - - PowerPoint PPT Presentation

β–Ά
mle map with latent variables
SMART_READER_LITE
LIVE PREVIEW

MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - - PowerPoint PPT Presentation

Examples: MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls Lagrange multipliers Assume an original


slide-1
SLIDE 1

Examples: MLE/MAP With Latent Variables

CMSC 691 UMBC

slide-2
SLIDE 2

Outline

Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

slide-3
SLIDE 3

Assume an original optimization problem

Lagrange multipliers

slide-4
SLIDE 4

Assume an original optimization problem We convert it to a new optimization problem:

Lagrange multipliers

slide-5
SLIDE 5

Lagrange multipliers: an equivalent problem?

slide-6
SLIDE 6

Lagrange multipliers: an equivalent problem?

slide-7
SLIDE 7

Lagrange multipliers: an equivalent problem?

slide-8
SLIDE 8

Outline

Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

slide-9
SLIDE 9

Recap: Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Categorical: A single draw

  • Finite R.V. taking one of K values: 1, 2, …, K
  • π‘Œ ∼ Cat 𝜍 , 𝜍 ∈ ℝ𝐿
  • π‘ž π‘Œ = 1 = 𝜍1, π‘ž π‘Œ = 2 = 𝜍2, … π‘žαˆΊ

ሻ π‘Œ = 𝐿 = 𝜍𝐿

  • Generally, π‘ž π‘Œ = 𝑙 = Ο‚π‘˜ πœπ‘˜

𝟐[𝑙=π‘˜]

  • 1 𝑑 = α‰Š1, 𝑑 is true

0, 𝑑 is false Multinomial: Sum of N iid Categorical draws

  • Vector of size K representing how often

value k was drawn

  • π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ𝐿
slide-10
SLIDE 10

Recap: Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Categorical: A single draw

  • Finite R.V. taking one of K values: 1, 2, …, K
  • π‘Œ ∼ Cat 𝝇 , 𝜍 ∈ ℝ𝐿
  • π‘ž π‘Œ = 1 = 𝜍1, π‘ž π‘Œ = 2 = 𝜍2, … π‘žαˆΊ

ሻ π‘Œ = 𝐿 = 𝜍𝐿

  • Generally, π‘ž π‘Œ = 𝑙 = Ο‚π‘˜ πœπ‘˜

𝟐[𝑙=π‘˜]

  • 1 𝑑 = α‰Š1, 𝑑 is true

0, 𝑑 is false Multinomial: Sum of N iid Categorical draws

  • Vector of size K representing how often

value k was drawn

  • π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ𝐿

What if we want to make 𝝇 a random variable?

slide-11
SLIDE 11

Distribution of (multinomial) distributions

If πœ„ is a K-dimensional multinomial parameter

slide-12
SLIDE 12

Distribution of (multinomial) distributions

If πœ„ is a K-dimensional multinomial parameter

πœ„ ∈ Ξ”πΏβˆ’1, πœ„π‘™ β‰₯ 0, ෍

𝑙

πœ„π‘™ = 1

we want some density FαˆΊπ›½αˆ» that describes πœ„

slide-13
SLIDE 13

Distribution of (multinomial) distributions

If πœ„ is a K-dimensional multinomial parameter

πœ„ ∈ Ξ”πΏβˆ’1, πœ„π‘™ β‰₯ 0, ෍

𝑙

πœ„π‘™ = 1

we want some density FαˆΊπ›½αˆ» that describes πœ„ πœ„ ∼ 𝐺 𝛽 ΰΆ±

πœ„

𝐺 πœ„; 𝛽 π‘’πœ„ = 1

ΰΆ±

πœ„

𝑒𝐺 πœ„; 𝛽 = 1

slide-14
SLIDE 14

Two Primary Options

Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ σ𝑙 𝛽𝑙 ς𝑙 Ξ“ 𝛽𝑙 ΰ·‘

𝑙

πœ„π‘™

𝛽𝑙

𝛽 ∈ ℝ+

𝐿

https://en.wikipedia.org/wiki/Dirichlet_distribution

https://en.wikipedia.org/wiki/Logit-normal_distribution

slide-15
SLIDE 15

Two Primary Options

Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ σ𝑙 𝛽𝑙 ς𝑙 Ξ“ 𝛽𝑙 ΰ·‘

𝑙

πœ„π‘™

𝛽𝑙

𝛽 ∈ ℝ+

𝐿

https://en.wikipedia.org/wiki/Dirichlet_distribution

A Beta distribution is the special case when K=2

slide-16
SLIDE 16

Two Primary Options

Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ σ𝑙 𝛽𝑙 ς𝑙 Ξ“ 𝛽𝑙 ΰ·‘

𝑙

πœ„π‘™

𝛽𝑙

𝛽 ∈ ℝ+

𝐿

Logistic Normal

πœ„ ∼ 𝐺 𝜈, Ξ£ ↔ logit πœ„ ∼ 𝑂 𝜈, Ξ£ 𝐺 𝜈, Ξ£ ∝ Ο‚πœ„π‘™

βˆ’1 exp βˆ’1

2 𝑧 βˆ’ 𝜈 π‘ˆΞ£βˆ’1 𝑧 βˆ’ 𝜈 𝑧𝑙 = log πœ„π‘™ πœ„πΏ

https://en.wikipedia.org/wiki/Dirichlet_distribution

https://en.wikipedia.org/wiki/Logit-normal_distribution

A Beta distribution is the special case when K=2

slide-17
SLIDE 17

Outline

Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

slide-18
SLIDE 18

Generative Story for Rolling a Die

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story β€œfor each” loop becomes a product Calculate π‘ž π‘₯𝑗 according to provided distribution

slide-19
SLIDE 19

Generative Story for Rolling a Die

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story a probability distribution over 6 sides of the die ෍

𝑙=1 6

πœ„π‘™ = 1 0 ≀ πœ„π‘™ ≀ 1, βˆ€π‘™ β€œfor each” loop becomes a product Calculate π‘ž π‘₯𝑗 according to provided distribution

slide-20
SLIDE 20

Learning Parameters for the Die Model

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do?

slide-21
SLIDE 21

Learning Parameters for the Die Model

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe

slide-22
SLIDE 22

Learning Parameters for the Die Model

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete

  • bservations) What loss

function do we minimize to maximize log-likelihood?

slide-23
SLIDE 23

Learning Parameters for the Die Model

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete

  • bservations) What loss

function do we minimize to maximize log-likelihood? A: Cross-entropy

slide-24
SLIDE 24

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

p(1) = ? p(3) = ? p(5) = ? p(2) = ? p(4) = ? p(6) = ?

If you observe these 9 rolls… …what are β€œreasonable” estimates for p(w)?

slide-25
SLIDE 25

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

p(1) = 2/9 p(3) = 1/9 p(5) = 1/9 p(2) = 1/9 p(4) = 3/9 p(6) = 1/9 maximum likelihood estimates

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-) likelihood to learn the probability parameters

If you observe these 9 rolls… …what are β€œreasonable” estimates for p(w)?

slide-26
SLIDE 26

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

Q: What’s the generative story?

slide-27
SLIDE 27

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story Q: What’s the objective?

slide-28
SLIDE 28

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story

β„’ πœ„ = ෍

𝑗

log π‘žπœ„αˆΊπ‘₯π‘—αˆ» = ෍

𝑗

log πœ„π‘₯𝑗

Maximize Log-likelihood

slide-29
SLIDE 29

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story

β„’ πœ„ = ෍

𝑗

log πœ„π‘₯𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

slide-30
SLIDE 30

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story

β„’ πœ„ = ෍

𝑗

log πœ„π‘₯𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing πœ„π‘™ (we know πœ„ must be a distribution, but it’s not specified)

slide-31
SLIDE 31

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

β„’ πœ„ = ෍

𝑗

log πœ„π‘₯𝑗 s. t. ෍

𝑙=1 6

πœ„π‘™ = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≀ πœ„π‘™, but it complicates the problem and, right now, is not needed)

solve using Lagrange multipliers

slide-32
SLIDE 32

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

β„± πœ„ = ෍

𝑗

log πœ„π‘₯𝑗 βˆ’ πœ‡ ෍

𝑙=1 6

πœ„π‘™ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≀ πœ„π‘™, but it complicates the problem and, right now, is not needed)

πœ–β„± πœ„ πœ–πœ„π‘™ = ෍

𝑗:π‘₯𝑗=𝑙

1 πœ„π‘₯𝑗 βˆ’ πœ‡ πœ–β„± πœ„ πœ–πœ‡ = βˆ’ ෍

𝑙=1 6

πœ„π‘™ + 1

slide-33
SLIDE 33

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

β„± πœ„ = ෍

𝑗

log πœ„π‘₯𝑗 βˆ’ πœ‡ ෍

𝑙=1 6

πœ„π‘™ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≀ πœ„π‘™, but it complicates the problem and, right now, is not needed)

πœ„π‘™ = σ𝑗:π‘₯𝑗=𝑙 1 πœ‡

  • ptimal πœ‡ when ෍

𝑙=1 6

πœ„π‘™ = 1

slide-34
SLIDE 34

Learning Parameters for the Die Model: Maximum Likelihood (Math)

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

β„± πœ„ = ෍

𝑗

log πœ„π‘₯𝑗 βˆ’ πœ‡ ෍

𝑙=1 6

πœ„π‘™ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≀ πœ„π‘™, but it complicates the problem and, right now, is not needed)

πœ„π‘™ = σ𝑗:π‘₯𝑗=𝑙 1 σ𝑙 σ𝑗:π‘₯𝑗=𝑙 1 = 𝑂𝑙 𝑂

  • ptimal πœ‡ when ෍

𝑙=1 6

πœ„π‘™ = 1

slide-35
SLIDE 35

Learning Parameters for the Die Model: Maximum Likelihood (Math) with πœ„ as RV

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

πœ„ ∼ Dir 𝛽 for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story with πœ„ as RV

slide-36
SLIDE 36

Learning Parameters for the Die Model: Maximum Likelihood (Math) with πœ„ as RV

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

N different (independent) rolls

π‘₯1 = 1 π‘₯2 = 5 π‘₯3 = 4 β‹―

πœ„ ∼ Dir 𝛽 for roll 𝑗 = 1 to 𝑂: π‘₯𝑗 ∼ CatαˆΊπœ„αˆ»

Generative Story with πœ„ as RV Objective with πœ„ as RV

β„’ πœ„ = log Dir πœ„; 𝛽 + ෍

𝑗

log πœ„π‘₯𝑗 s. t. ෍

𝑙=1 6

πœ„π‘™ = 1

slide-37
SLIDE 37

Outline

Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

slide-38
SLIDE 38

Example: Conditionally Rolling a Die

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

slide-39
SLIDE 39

Example: Conditionally Rolling a Die

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

β‹― 𝑨1 = π‘ˆ 𝑨2 = 𝐼

First flip a coin…

add complexity to better explain what we see

slide-40
SLIDE 40

Example: Conditionally Rolling a Die

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

π‘₯1 = 1 π‘₯2 = 5 β‹― 𝑨1 = π‘ˆ 𝑨2 = 𝐼

First flip a coin… …then roll a different die depending on the coin flip

slide-41
SLIDE 41

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

If you observe the 𝑨𝑗 values, this is easy!

slide-42
SLIDE 42

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

πœ‡ = distribution over coin ሺzሻ π›ΏαˆΊπΌαˆ» = distribution for die when coin comes up heads π‘₯𝑗 ~ Cat π›ΏαˆΊπ‘¨π‘—αˆ» π›ΏαˆΊπ‘ˆαˆ» = distribution for die when coin comes up tails for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli πœ‡

slide-43
SLIDE 43

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

πœ‡ = distribution over coin ሺzሻ π›ΏαˆΊπΌαˆ» = distribution for H die π‘₯𝑗 ~ Cat π›ΏαˆΊπ‘¨π‘—αˆ» π›ΏαˆΊπ‘ˆαˆ» = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli πœ‡

Second: Generative Story β†’ Objective

β„± πœ„ = ෍

𝑗 π‘œ

ሺlog πœ‡π‘¨π‘— + log 𝛿π‘₯𝑗

αˆΊπ‘¨π‘—αˆ»αˆ» βˆ’πœƒ ෍

𝑙=1 2

πœ‡π‘™ βˆ’ 1 βˆ’ ෍

𝑙 2

πœ€π‘™ ෍

π‘˜ 6

π›Ώπ‘˜

αˆΊπ‘™αˆ» βˆ’ 1

Lagrange multiplier constraints

slide-44
SLIDE 44

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

πœ‡ = distribution over coin ሺzሻ π›ΏαˆΊπΌαˆ» = distribution for H die π‘₯𝑗 ~ Cat π›ΏαˆΊπ‘¨π‘—αˆ» π›ΏαˆΊπ‘ˆαˆ» = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli πœ‡

Second: Generative Story β†’ Objective

β„± πœ„ = ෍

𝑗 π‘œ

ሺlog πœ‡π‘¨π‘— + log 𝛿π‘₯𝑗

αˆΊπ‘¨π‘—αˆ»αˆ» βˆ’πœƒ ෍

𝑙=1 2

πœ‡π‘™ βˆ’ 1 βˆ’ ෍

𝑙=1 2

πœ€π‘™ ෍

π‘˜=1 6

π›Ώπ‘˜

αˆΊπ‘™αˆ» βˆ’ 1

slide-45
SLIDE 45

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

πœ‡ = distribution over coin ሺzሻ π›ΏαˆΊπΌαˆ» = distribution for H die π‘₯𝑗 ~ Cat π›ΏαˆΊπ‘¨π‘—αˆ» π›ΏαˆΊπ‘ˆαˆ» = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli πœ‡

Second: Generative Story β†’ Objective

β„± πœ„ = ෍

𝑗 π‘œ

ሺlog πœ‡π‘¨π‘— + log 𝛿π‘₯𝑗

αˆΊπ‘¨π‘—αˆ»αˆ» βˆ’πœƒ ෍

𝑙=1 2

πœ‡π‘™ βˆ’ 1 βˆ’ ෍

𝑙=1 2

πœ€π‘™ ෍

π‘˜=1 6

π›Ώπ‘˜

αˆΊπ‘™αˆ» βˆ’ 1

But if you don’t observe the 𝑨𝑗 values, this is not easy!

slide-46
SLIDE 46

Example: Conditionally Rolling a Die

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( we don’t actually observe these z values we just see the items w goal: maximize (log-)likelihood

slide-47
SLIDE 47

Example: Conditionally Rolling a Die

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values we just see the items w goal: maximize (log-)likelihood if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-48
SLIDE 48

Example: Conditionally Rolling a Die

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

slide-49
SLIDE 49

Example: Conditionally Rolling a Die

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1

π‘žαˆΊπ‘¨1, π‘₯ሻ ෍

𝑨2

π‘žαˆΊπ‘¨2, π‘₯ሻ β‹― ෍

𝑨𝑂

π‘žαˆΊπ‘¨π‘‚, π‘₯ሻ

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(