mle map with latent variables
play

MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - PowerPoint PPT Presentation

Examples: MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls Lagrange multipliers Assume an original


  1. Examples: MLE/MAP With Latent Variables CMSC 691 UMBC

  2. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  3. Lagrange multipliers Assume an original optimization problem

  4. Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

  5. Lagrange multipliers: an equivalent problem?

  6. Lagrange multipliers: an equivalent problem?

  7. Lagrange multipliers: an equivalent problem?

  8. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  9. Recap: Common Distributions Categorical: A single draw β€’ Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial π‘Œ ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 β€’ β€’ π‘ž π‘Œ = 1 = 𝜍 1 , π‘ž π‘Œ = 2 = 𝜍 2 , … π‘žαˆΊ π‘Œ = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ሻ 𝟐[𝑙=π‘˜] β€’ Generally, π‘ž π‘Œ = 𝑙 = Ο‚ π‘˜ 𝜍 π‘˜ Poisson 1 𝑑 = α‰Š1, 𝑑 is true β€’ Normal 0, 𝑑 is false Gamma Multinomial: Sum of N iid Categorical draws β€’ Vector of size K representing how often value k was drawn π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 β€’

  10. Recap: Common Distributions Categorical: A single draw β€’ Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial β€’ π‘Œ ∼ Cat 𝝇 , 𝜍 ∈ ℝ 𝐿 β€’ π‘ž π‘Œ = 1 = 𝜍 1 , π‘ž π‘Œ = 2 = 𝜍 2 , … π‘žαˆΊ π‘Œ = Categorical/Multinomial What if we ሻ 𝐿 = 𝜍 𝐿 𝟐[𝑙=π‘˜] β€’ Generally, π‘ž π‘Œ = 𝑙 = Ο‚ π‘˜ 𝜍 π‘˜ Poisson want to make 1 𝑑 = α‰Š1, 𝑑 is true β€’ Normal 0, 𝑑 is false 𝝇 a random Gamma Multinomial: Sum of N iid Categorical draws β€’ Vector of size K representing how often variable? value k was drawn β€’ π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿

  11. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter

  12. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter πœ„ ∈ Ξ” πΏβˆ’1 , πœ„ 𝑙 β‰₯ 0, ෍ πœ„ 𝑙 = 1 𝑙 we want some density FαˆΊπ›½αˆ» that describes πœ„

  13. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter πœ„ ∈ Ξ” πΏβˆ’1 , πœ„ 𝑙 β‰₯ 0, ෍ πœ„ 𝑙 = 1 𝑙 we want some density FαˆΊπ›½αˆ» that describes πœ„ πœ„ ∼ 𝐺 𝛽 ΰΆ± 𝐺 πœ„; 𝛽 π‘’πœ„ = 1 πœ„ ΰΆ± 𝑒𝐺 πœ„; 𝛽 = 1 πœ„

  14. Two Primary Options Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 Ο‚ 𝑙 Ξ“ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

  15. Two Primary Options Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 Ο‚ 𝑙 Ξ“ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Dirichlet_distribution

  16. Two Primary Options Dirichlet Distribution Logistic Normal Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 πœ„ ∼ 𝐺 𝜈, Ξ£ ↔ logit πœ„ ∼ 𝑂 𝜈, Ξ£ 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 𝐺 𝜈, Ξ£ ∝ Ο‚ 𝑙 Ξ“ 𝛽 𝑙 βˆ’1 exp βˆ’1 𝑙 𝑧 βˆ’ 𝜈 π‘ˆ Ξ£ βˆ’1 𝑧 βˆ’ 𝜈 Ο‚πœ„ 𝑙 𝐿 𝛽 ∈ ℝ + 2 𝑧 𝑙 = log πœ„ 𝑙 πœ„ 𝐿 A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

  17. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  18. Generative Story for Rolling a Die N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 β€œ for each ” loop Generative Story π‘₯ 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Calculate π‘ž π‘₯ 𝑗 according to π‘₯ 3 = 4 provided distribution β‹―

  19. Generative Story for Rolling a Die N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 β€œ for each ” loop Generative Story π‘₯ 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Calculate π‘ž π‘₯ 𝑗 according to π‘₯ 3 = 4 a probability provided distribution over 6 distribution sides of the die β‹― 6 0 ≀ πœ„ 𝑙 ≀ 1, βˆ€π‘™ ෍ πœ„ 𝑙 = 1 𝑙=1

  20. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- likelihood a reasonable thing to do?

  21. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do?

  22. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?

  23. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?

  24. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?

  25. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9

  26. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Q: What’s the generative π‘₯ 1 = 1 story? π‘₯ 2 = 5 π‘₯ 3 = 4 β‹―

  27. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story π‘₯ 1 = 1 for roll 𝑗 = 1 to 𝑂: π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 π‘₯ 3 = 4 Q: What’s the objective? β‹―

  28. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story π‘₯ 1 = 1 for roll 𝑗 = 1 to 𝑂: π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Maximize Log-likelihood π‘₯ 3 = 4 β„’ πœ„ = ෍ log π‘ž πœ„ ሺπ‘₯ 𝑗 ሻ 𝑗 β‹― = ෍ log πœ„ π‘₯ 𝑗 𝑗

  29. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

  30. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing πœ„ 𝑙 ( we know πœ„ must be a distribution, but it’s not specified)

  31. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 s. t. ෍ πœ„ 𝑙 = 1 0 ≀ πœ„ 𝑙 , but it complicates the problem and, right 𝑗 𝑙=1 now , is not needed) solve using Lagrange multipliers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend