MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - - PowerPoint PPT Presentation
MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - - PowerPoint PPT Presentation
Examples: MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls Lagrange multipliers Assume an original
Outline
Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Assume an original optimization problem
Lagrange multipliers
Assume an original optimization problem We convert it to a new optimization problem:
Lagrange multipliers
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Outline
Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Recap: Common Distributions
Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma
Categorical: A single draw
- Finite R.V. taking one of K values: 1, 2, β¦, K
- π βΌ Cat π , π β βπΏ
- π π = 1 = π1, π π = 2 = π2, β¦ παΊ
α» π = πΏ = ππΏ
- Generally, π π = π = Οπ ππ
π[π=π]
- 1 π = α1, π is true
0, π is false Multinomial: Sum of N iid Categorical draws
- Vector of size K representing how often
value k was drawn
- π βΌ Multinomial π, π , π β βπΏ
Recap: Common Distributions
Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma
Categorical: A single draw
- Finite R.V. taking one of K values: 1, 2, β¦, K
- π βΌ Cat π , π β βπΏ
- π π = 1 = π1, π π = 2 = π2, β¦ παΊ
α» π = πΏ = ππΏ
- Generally, π π = π = Οπ ππ
π[π=π]
- 1 π = α1, π is true
0, π is false Multinomial: Sum of N iid Categorical draws
- Vector of size K representing how often
value k was drawn
- π βΌ Multinomial π, π , π β βπΏ
What if we want to make π a random variable?
Distribution of (multinomial) distributions
If π is a K-dimensional multinomial parameter
Distribution of (multinomial) distributions
If π is a K-dimensional multinomial parameter
π β ΞπΏβ1, ππ β₯ 0, ΰ·
π
ππ = 1
we want some density FαΊπ½α» that describes π
Distribution of (multinomial) distributions
If π is a K-dimensional multinomial parameter
π β ΞπΏβ1, ππ β₯ 0, ΰ·
π
ππ = 1
we want some density FαΊπ½α» that describes π π βΌ πΊ π½ ΰΆ±
π
πΊ π; π½ ππ = 1
ΰΆ±
π
ππΊ π; π½ = 1
Two Primary Options
Dirichlet Distribution Dir π; π½ = Ξ Οπ π½π Οπ Ξ π½π ΰ·
π
ππ
π½π
π½ β β+
πΏ
https://en.wikipedia.org/wiki/Dirichlet_distribution
https://en.wikipedia.org/wiki/Logit-normal_distribution
Two Primary Options
Dirichlet Distribution Dir π; π½ = Ξ Οπ π½π Οπ Ξ π½π ΰ·
π
ππ
π½π
π½ β β+
πΏ
https://en.wikipedia.org/wiki/Dirichlet_distribution
A Beta distribution is the special case when K=2
Two Primary Options
Dirichlet Distribution Dir π; π½ = Ξ Οπ π½π Οπ Ξ π½π ΰ·
π
ππ
π½π
π½ β β+
πΏ
Logistic Normal
π βΌ πΊ π, Ξ£ β logit π βΌ π π, Ξ£ πΊ π, Ξ£ β Οππ
β1 exp β1
2 π§ β π πΞ£β1 π§ β π π§π = log ππ ππΏ
https://en.wikipedia.org/wiki/Dirichlet_distribution
https://en.wikipedia.org/wiki/Logit-normal_distribution
A Beta distribution is the special case when K=2
Outline
Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Generative Story for Rolling a Die
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story βfor eachβ loop becomes a product Calculate π π₯π according to provided distribution
Generative Story for Rolling a Die
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story a probability distribution over 6 sides of the die ΰ·
π=1 6
ππ = 1 0 β€ ππ β€ 1, βπ βfor eachβ loop becomes a product Calculate π π₯π according to provided distribution
Learning Parameters for the Die Model
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
Q: Why is maximizing log- likelihood a reasonable thing to do?
Learning Parameters for the Die Model
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe
Learning Parameters for the Die Model
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete
- bservations) What loss
function do we minimize to maximize log-likelihood?
Learning Parameters for the Die Model
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe Q: (for discrete
- bservations) What loss
function do we minimize to maximize log-likelihood? A: Cross-entropy
Learning Parameters for the Die Model: Maximum Likelihood (Intuition)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
p(1) = ? p(3) = ? p(5) = ? p(2) = ? p(4) = ? p(6) = ?
If you observe these 9 rollsβ¦ β¦what are βreasonableβ estimates for p(w)?
Learning Parameters for the Die Model: Maximum Likelihood (Intuition)
p(1) = 2/9 p(3) = 1/9 p(5) = 1/9 p(2) = 1/9 p(4) = 3/9 p(6) = 1/9 maximum likelihood estimates
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
maximize (log-) likelihood to learn the probability parameters
If you observe these 9 rollsβ¦ β¦what are βreasonableβ estimates for p(w)?
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
Q: Whatβs the generative story?
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story Q: Whatβs the objective?
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story
β π = ΰ·
π
log ππαΊπ₯πα» = ΰ·
π
log ππ₯π
Maximize Log-likelihood
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story
β π = ΰ·
π
log ππ₯π
Maximize Log-likelihood Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)?
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story
β π = ΰ·
π
log ππ₯π
Maximize Log-likelihood Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing ππ (we know π must be a distribution, but itβs not specified)
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
β π = ΰ·
π
log ππ₯π s. t. ΰ·
π=1 6
ππ = 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 β€ ππ, but it complicates the problem and, right now, is not needed)
solve using Lagrange multipliers
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
β± π = ΰ·
π
log ππ₯π β π ΰ·
π=1 6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 β€ ππ, but it complicates the problem and, right now, is not needed)
πβ± π πππ = ΰ·
π:π₯π=π
1 ππ₯π β π πβ± π ππ = β ΰ·
π=1 6
ππ + 1
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
β± π = ΰ·
π
log ππ₯π β π ΰ·
π=1 6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 β€ ππ, but it complicates the problem and, right now, is not needed)
ππ = Οπ:π₯π=π 1 π
- ptimal π when ΰ·
π=1 6
ππ = 1
Learning Parameters for the Die Model: Maximum Likelihood (Math)
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
β± π = ΰ·
π
log ππ₯π β π ΰ·
π=1 6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints 0 β€ ππ, but it complicates the problem and, right now, is not needed)
ππ = Οπ:π₯π=π 1 Οπ Οπ:π₯π=π 1 = ππ π
- ptimal π when ΰ·
π=1 6
ππ = 1
Learning Parameters for the Die Model: Maximum Likelihood (Math) with π as RV
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
π βΌ Dir π½ for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story with π as RV
Learning Parameters for the Die Model: Maximum Likelihood (Math) with π as RV
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π
N different (independent) rolls
π₯1 = 1 π₯2 = 5 π₯3 = 4 β―
π βΌ Dir π½ for roll π = 1 to π: π₯π βΌ CatαΊπα»
Generative Story with π as RV Objective with π as RV
β π = log Dir π; π½ + ΰ·
π
log ππ₯π s. t. ΰ·
π=1 6
ππ = 1
Outline
Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Example: Conditionally Rolling a Die
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π
add complexity to better explain what we see
Example: Conditionally Rolling a Die
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π
β― π¨1 = π π¨2 = πΌ
First flip a coinβ¦
add complexity to better explain what we see
Example: Conditionally Rolling a Die
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π
add complexity to better explain what we see
π₯1 = 1 π₯2 = 5 β― π¨1 = π π¨2 = πΌ
First flip a coinβ¦ β¦then roll a different die depending on the coin flip
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π
add complexity to better explain what we see
If you observe the π¨π values, this is easy!
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
If you observe the π¨π values, this is easy!
First: Write the Generative Story
π = distribution over coin αΊzα» πΏαΊπΌα» = distribution for die when coin comes up heads π₯π ~ Cat πΏαΊπ¨πα» πΏαΊπα» = distribution for die when coin comes up tails for item π = 1 to π: π¨π ~ Bernoulli π
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
If you observe the π¨π values, this is easy!
First: Write the Generative Story
π = distribution over coin αΊzα» πΏαΊπΌα» = distribution for H die π₯π ~ Cat πΏαΊπ¨πα» πΏαΊπα» = distribution for T die for item π = 1 to π: π¨π ~ Bernoulli π
Second: Generative Story β Objective
β± π = ΰ·
π π
αΊlog ππ¨π + log πΏπ₯π
αΊπ¨πα»α» βπ ΰ·
π=1 2
ππ β 1 β ΰ·
π 2
ππ ΰ·
π 6
πΏπ
αΊπα» β 1
Lagrange multiplier constraints
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
If you observe the π¨π values, this is easy!
First: Write the Generative Story
π = distribution over coin αΊzα» πΏαΊπΌα» = distribution for H die π₯π ~ Cat πΏαΊπ¨πα» πΏαΊπα» = distribution for T die for item π = 1 to π: π¨π ~ Bernoulli π
Second: Generative Story β Objective
β± π = ΰ·
π π
αΊlog ππ¨π + log πΏπ₯π
αΊπ¨πα»α» βπ ΰ·
π=1 2
ππ β 1 β ΰ·
π=1 2
ππ ΰ·
π=1 6
πΏπ
αΊπα» β 1
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
If you observe the π¨π values, this is easy!
First: Write the Generative Story
π = distribution over coin αΊzα» πΏαΊπΌα» = distribution for H die π₯π ~ Cat πΏαΊπ¨πα» πΏαΊπα» = distribution for T die for item π = 1 to π: π¨π ~ Bernoulli π
Second: Generative Story β Objective
β± π = ΰ·
π π
αΊlog ππ¨π + log πΏπ₯π
αΊπ¨πα»α» βπ ΰ·
π=1 2
ππ β 1 β ΰ·
π=1 2
ππ ΰ·
π=1 6
πΏπ
αΊπα» β 1
But if you donβt observe the π¨π values, this is not easy!
Example: Conditionally Rolling a Die
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
if we did observe z, estimating the probability parameters would be easyβ¦ but we donβt! :( we donβt actually observe these z values we just see the items w goal: maximize (log-)likelihood
Example: Conditionally Rolling a Die
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
we donβt actually observe these z values we just see the items w goal: maximize (log-)likelihood if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :( if we did observe z, estimating the probability parameters would be easyβ¦ but we donβt! :(
Example: Conditionally Rolling a Die
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = ΰ·
π
π π₯π|π¨π π π¨π
we donβt actually observe these z values goal: maximize marginalized (log-)likelihood
Example: Conditionally Rolling a Die
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π
goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w
π π₯1, π₯2, β¦ , π₯π = ΰ·
π¨1
παΊπ¨1, π₯α» ΰ·
π¨2
παΊπ¨2, π₯α» β― ΰ·
π¨π
παΊπ¨π, π₯α»
if we did observe z, estimating the probability parameters would be easyβ¦ but we donβt! :( if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :(