MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - PowerPoint PPT Presentation

Examples: MLE/MAP With Latent Variables CMSC 691 UMBC

Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

Lagrange multipliers Assume an original optimization problem

Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

Lagrange multipliers: an equivalent problem?

Recap: Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞ሺ 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ሻ 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Gamma Multinomial: Sum of N iid Categorical draws • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •

Recap: Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial • 𝑌 ∼ Cat 𝝇 , 𝜍 ∈ ℝ 𝐿 • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞ሺ 𝑌 = Categorical/Multinomial What if we ሻ 𝐿 = 𝜍 𝐿 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson want to make 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false 𝝇 a random Gamma Multinomial: Sum of N iid Categorical draws • Vector of size K representing how often variable? value k was drawn • 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿

Distribution of (multinomial) distributions If 𝜄 is a K-dimensional multinomial parameter

Distribution of (multinomial) distributions If 𝜄 is a K-dimensional multinomial parameter 𝜄 ∈ Δ 𝐿−1 , 𝜄 𝑙 ≥ 0, ෍ 𝜄 𝑙 = 1 𝑙 we want some density Fሺ𝛽ሻ that describes 𝜄

Distribution of (multinomial) distributions If 𝜄 is a K-dimensional multinomial parameter 𝜄 ∈ Δ 𝐿−1 , 𝜄 𝑙 ≥ 0, ෍ 𝜄 𝑙 = 1 𝑙 we want some density Fሺ𝛽ሻ that describes 𝜄 𝜄 ∼ 𝐺 𝛽 න 𝐺 𝜄; 𝛽 𝑒𝜄 = 1 𝜄 න 𝑒𝐺 𝜄; 𝛽 = 1 𝜄

Two Primary Options Dirichlet Distribution Dir 𝜄; 𝛽 = Γ σ 𝑙 𝛽 𝑙 𝛽 𝑙 ෑ 𝜄 𝑙 ς 𝑙 Γ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

Two Primary Options Dirichlet Distribution Dir 𝜄; 𝛽 = Γ σ 𝑙 𝛽 𝑙 𝛽 𝑙 ෑ 𝜄 𝑙 ς 𝑙 Γ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Dirichlet_distribution

Two Primary Options Dirichlet Distribution Logistic Normal Dir 𝜄; 𝛽 = Γ σ 𝑙 𝛽 𝑙 𝜄 ∼ 𝐺 𝜈, Σ ↔ logit 𝜄 ∼ 𝑂 𝜈, Σ 𝛽 𝑙 ෑ 𝜄 𝑙 𝐺 𝜈, Σ ∝ ς 𝑙 Γ 𝛽 𝑙 −1 exp −1 𝑙 𝑧 − 𝜈 𝑈 Σ −1 𝑧 − 𝜈 ς𝜄 𝑙 𝐿 𝛽 ∈ ℝ + 2 𝑧 𝑙 = log 𝜄 𝑙 𝜄 𝐿 A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

Generative Story for Rolling a Die N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 “ for each ” loop Generative Story 𝑥 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑥 2 = 5 Calculate 𝑞 𝑥 𝑗 according to 𝑥 3 = 4 provided distribution ⋯

Generative Story for Rolling a Die N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 “ for each ” loop Generative Story 𝑥 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑥 2 = 5 Calculate 𝑞 𝑥 𝑗 according to 𝑥 3 = 4 a probability provided distribution over 6 distribution sides of the die ⋯ 6 0 ≤ 𝜄 𝑙 ≤ 1, ∀𝑙 ෍ 𝜄 𝑙 = 1 𝑙=1

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- likelihood a reasonable thing to do?

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do?

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?

Learning Parameters for the Die Model 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?

Learning Parameters for the Die Model: Maximum Likelihood (Intuition) 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are “reasonable” these 9 rolls… estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?

Learning Parameters for the Die Model: Maximum Likelihood (Intuition) 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are “reasonable” these 9 rolls… estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Q: What’s the generative 𝑥 1 = 1 story? 𝑥 2 = 5 𝑥 3 = 4 ⋯

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story 𝑥 1 = 1 for roll 𝑗 = 1 to 𝑂: 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑥 2 = 5 𝑥 3 = 4 Q: What’s the objective? ⋯

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story 𝑥 1 = 1 for roll 𝑗 = 1 to 𝑂: 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑥 2 = 5 Maximize Log-likelihood 𝑥 3 = 4 ℒ 𝜄 = ෍ log 𝑞 𝜄 ሺ𝑥 𝑗 ሻ 𝑗 ⋯ = ෍ log 𝜄 𝑥 𝑗 𝑗

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 𝑥 𝑗 ∼ Catሺ𝜄ሻ 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄 𝑙 ( we know 𝜄 must be a distribution, but it’s not specified)

Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints ℒ 𝜄 = ෍ log 𝜄 𝑥 𝑗 s. t. ෍ 𝜄 𝑙 = 1 0 ≤ 𝜄 𝑙 , but it complicates the problem and, right 𝑗 𝑙=1 now , is not needed) solve using Lagrange multipliers

MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - PowerPoint PPT Presentation

Examples: MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls Lagrange multipliers Assume an original

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Connections Public Transport is a core service Local Government Act Transport Disadvantaged

Silicon Sensors for High-Radiation Tracking Detectors- RD50 Status Report A. Junkes for the RD50

Advent of the leptonic probe Jorviks Universitet, King's Manor UK H. Simon GSI Darmstadt

Fermionic functional renormalization group for first-order phase transitions A mean-field model

Class website h,p://www.cs.rhodes.edu/~kirlinp/courses/cs2/s14 What is this

PODS and ICDT A Call to Roll-Back All Recent Changes Phokion G. Kolaitis University of

SE2: Introduction to Software Architecture Mei Nagappan What is Architecture? Encyclopedia

Execution Architecture Sofware Architecture VO (706.706) Roman Kern Version 2.1.3 Institute for

Sambuz

Useful Links

Newsletter

Mail Us