ml map estimation and bayesian
play

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2 Relation


  1. ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

  2. Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2

  3. Relation of learning & statistics } Target model in the learning problems can be considered as a statistical model } For a fixed set of data and underlying target (statistical model), the estimation methods try to estimate the target from the available data 3

  4. Density estimation } Estimating the probability density function π‘ž(π’š) , given a ( set of data points π’š % drawn from it. %&' } Main approaches of density estimation: } Parametric: assuming a parameterized model for density function Β¨ A number of parameters are optimized by fitting the model to the data set } Nonparametric (Instance-based): No specific parametric model is assumed } The form of the density function is determined entirely by the data 4

  5. Parametric density estimation } Estimating the probability density function π‘ž(π’š) , given a ( set of data points π’š % drawn from it. %&' } Assume that π‘ž(π’š) in terms of a specific functional form which has a number of adjustable parameters. } Methods for parameter estimation } Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation 5

  6. Parametric density estimation } Goal: estimate parameters of a distribution from a dataset 𝒠 = {π’š ' , . . . , π’š (() } } 𝒠 contains 𝑂 independent, identically distributed (i.i.d.) training samples. } We need to determine 𝜾 given {π’š ' , … , π’š (() } } How to represent 𝜾 ? } 𝜾 βˆ— or π‘ž(𝜾) ? 6

  7. Example 𝑄 𝑦 𝜈 = 𝑂(𝑦|𝜈, 1) 7

  8. Example 8

  9. Maximum Likelihood Estimation (MLE) } Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data. } Likelihood is the conditional probability of observations 𝒠 = π’š (') , π’š (9) , … , π’š (() given the value of parameters 𝜾 } Assuming i.i.d. observations: ( π‘ž 𝒠 𝜾 = : π‘ž(π’š (%) |𝜾) %&' likelihood of 𝜾 w.r.t. the samples } Maximum Likelihood estimation ; <= = argmax 𝜾 π‘ž 𝒠 𝜾 𝜾 9

  10. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 10

  11. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 11

  12. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 12

  13. Maximum Likelihood Estimation (MLE) ( ( β„’ 𝜾 = ln π‘ž 𝒠 𝜾 = ln : π‘ž π’š (%) 𝜾 = H ln π‘ž π’š (%) 𝜾 %&' %&' ( H ln π‘ž π’š (%) 𝜾 ; <= = argmax 𝜾 β„’(𝜾) = argmax 𝜾 𝜾 %&' } Thus, we solve 𝛼 𝜾 β„’ 𝜾 = 𝟏 } to find global optimum 13

  14. MLE Bernoulli } Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM ( ( = : πœ„ M O 1 βˆ’ πœ„ 'NM O π‘ž 𝒠 πœ„ = : π‘ž(𝑦 % |πœ„) %&' %&' ( ( ln π‘ž 𝒠 πœ„ = H ln π‘ž(𝑦 % |πœ„) = H{𝑦 % ln πœ„ + (1 βˆ’ 𝑦 % ) ln 1 βˆ’ πœ„ } %&' %&' ( 𝑦 (%) = 0 β‡’ πœ„ <= = βˆ‘ πœ– ln π‘ž 𝒠 πœ„ = 𝑛 %&' πœ–πœ„ 𝑂 𝑂 14

  15. MLE Bernoulli: example U D <= = } Example: 𝒠 = {1,1,1} , πœ„ U = 1 } Prediction: all future tosses will land heads up } Overfitting to 𝒠 15

  16. MLE: Multinomial distribution } Multinomial distribution (on variable with 𝐿 state): Parameter space: 𝜾 Y M X = πœ„ ' , … , πœ„ Y 𝑄 π’š 𝜾 = : πœ„ W πœ„ % ∈ 0,1 W&' Y H πœ„ W = 1 𝑄 𝑦 W = 1 = πœ„ W W&' πœ„ 9 π’š = 𝑦 ' , … , 𝑦 Y 𝑦 W ∈ {0,1} Y H 𝑦 W = 1 πœ„ ' W&' πœ„ U 16

  17. MLE: Multinomial distribution 𝒠 = π’š (') , π’š (9) , … , π’š (() ( ( Y Y (O) (O) [ βˆ‘ M X M X 𝑄 𝒠 𝜾 = : 𝑄(π’š % |𝜾) O\] = : : πœ„ W = : πœ„ W W&' W&' ( %&' %&' (%) 𝑂 W = H 𝑦 W %&' Y Y H 𝑂 W = 𝑂 β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’ H πœ„ W ) W&' W&' (%) ( βˆ‘ 𝑦 W = 𝑂 W %&' _ W = πœ„ 𝑂 𝑂 17

  18. οΏ½ οΏ½ MLE Gaussian: unknown 𝜈 1 𝑓 N ' 9e f MNg f π‘ž 𝑦 𝜈 = 2𝜌 𝜏 𝜏 βˆ’ 1 9 ln π‘ž(𝑦 % |𝜈) = βˆ’ ln 2𝜏 9 𝑦 % βˆ’ 𝜈 2𝜌 ( ( πœ–β„’ 𝜈 = 0 β‡’ πœ– = 0 β‡’ H 1 H ln π‘ž 𝑦 (%) 𝜈 𝜏 9 𝑦 % βˆ’ 𝜈 πœ–πœˆ πœ–πœˆ %&' %&' ( = 0 β‡’ πœˆΜ‚ <= = 1 𝑂 H 𝑦 % %&' MLE corresponds to many well-known estimation methods. 18

  19. MLE Gaussian: unknown 𝜈 and 𝜏 𝜾 = 𝜈, 𝜏 𝛼 𝜾 β„’ 𝜾 = 𝟏 ( πœ–β„’ 𝜈, 𝜏 = 0 β‡’ πœˆΜ‚ <= = 1 𝑂 H 𝑦 % πœ–πœˆ %&' ( πœ–β„’ 𝜈, 𝜏 i πŸ‘<= = 1 9 𝑂 H 𝑦 % βˆ’ πœˆΜ‚ <= = 0 β‡’ 𝜏 πœ–πœ %&' 19

  20. Maximum A Posteriori (MAP) estimation } MAP estimation ; <kl = argmax 𝜾 π‘ž 𝜾 𝒠 𝜾 } Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾) ; <kl = argmax 𝜾 π‘ž 𝒠 𝜾 π‘ž(𝜾) 𝜾 } Example of prior distribution: π‘ž πœ„ = π’ͺ(πœ„ o , 𝜏 9 ) 20

  21. MAP estimation Gaussian: unknown 𝜈 π‘ž(𝑦|𝜈)~𝑂(𝜈, 𝜏 9 ) 𝜈 is the only unknown parameter 9 ) 𝜈 o and 𝜏 o are known π‘ž(𝜈|𝜈 o )~𝑂(𝜈 o , 𝜏 o ( 𝑒 π‘’πœˆ ln π‘ž(𝜈) : π‘ž 𝑦 % 𝜈 = 0 %&' ( β‡’ H 1 βˆ’ 1 𝜏 9 𝑦 % βˆ’ 𝜈 9 𝜈 βˆ’ 𝜈 o = 0 𝜏 o %&' 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 βˆ‘ %&' β‡’ 𝜈 i <kl = 9 1 + 𝜏 o 𝜏 9 𝑂 [ f M O βˆ‘ e r e f ≫ 1 or 𝑂 β†’ ∞ β‡’ πœˆΜ‚ <kl = πœˆΜ‚ <= = O\] ( 21

  22. Maximum A Posteriori (MAP) estimation } Given a set of observations 𝒠 and a prior distribution π‘ž(𝜾) on parameters, the parameter vector that maximizes π‘ž 𝒠 𝜾 π‘ž(𝜾) is found. π‘ž 𝒠 πœ„ π‘ž 𝒠 πœ„ D <kl β‰… πœ„ D <= D <kl > πœ„ D <= πœ„ πœ„ 9 𝜏 9 π‘‚πœ o 𝜈 ( = 9 + 𝜏 9 𝜈 o + 9 + 𝜏 9 𝜈 <= π‘‚πœ o π‘‚πœ o 22

  23. MAP estimation Gaussian: unknown 𝜈 (known 𝜏 ) π‘ž 𝜈 𝒠 ∝ π‘ž 𝜈 π‘ž(𝒠|𝜈) π‘ž 𝜈 𝒠 = 𝑂 𝜈 𝜈 ( , 𝜏 ( 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 βˆ‘ %&' 𝜈 ( = 9 1 + 𝜏 o 𝜏 9 𝑂 π‘ž(𝜈) 1 9 = 1 9 + 𝑂 𝜏 9 𝜏 ( 𝜏 o [Bishop] More samples ⟹ sharper π‘ž(𝜈|𝒠) Higher confidence in estimation 23

  24. Conjugate Priors } We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties } Choosing a prior such that the posterior distribution that is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ· | 𝑄(𝜾|𝜷 | ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 24

  25. Prior for Bernoulli Likelihood 𝛽 ' 𝐹 πœ„ = 𝛽 o + 𝛽 ' } Beta distribution over πœ„ ∈ [0,1] : 𝛽 ' βˆ’ 1 D = πœ„ 𝛽 o βˆ’ 1 + 𝛽 ' βˆ’ 1 Beta πœ„ 𝛽 ' , 𝛽 o ∝ πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' most probable πœ„ Beta πœ„ 𝛽 ' , 𝛽 o = Ξ“(𝛽 o + 𝛽 ' ) Ξ“(𝛽 o )Ξ“(𝛽 ' ) πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' } Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM 25

  26. Beta distribution 26

  27. Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) ( : πœ„ M O 1 βˆ’ πœ„ 'NM O = Beta πœ„ 𝛽 ' , 𝛽 o %&' ∝ πœ„ †‑ƒ ] N' 1 βˆ’ πœ„ (N†‑ƒ r N' ∝ πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' ( 𝑛 = H 𝑦 (%) | , 𝛽 o | β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽 ' %&' | = 𝛽 ' + 𝑛 𝛽 ' | = 𝛽 o + 𝑂 βˆ’ 𝑛 𝛽 o 27

  28. Example π‘ž 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM Prior Beta: 𝛽 o = 𝛽 ' = 2 Bernoulli π‘ž 𝑦 = 1 πœ„ πœ„ πœ„ Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() : Posterior 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) | = 5, 𝛽 o | = 2 Beta: 𝛽 ' 𝛽 o = 𝛽 ' = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 | βˆ’ 1 𝛽 ' | βˆ’ 1 = 4 D <kl = argmax πœ„ 𝑄 πœ„ 𝒠 = | βˆ’ 1 + 𝛽 o 𝛽 ' 5 Ε’ πœ„ 28

  29. Toss example } MAP estimation can avoid overfitting D <= = 1 } 𝒠 = {1,1,1} , πœ„ D <kl = 0.8 (with prior π‘ž πœ„ = Beta πœ„ 2,2 ) } πœ„ 29

  30. Bayesian inference } Parameters 𝜾 as random variables with a priori distribution } Bayesian estimation utilizes the available prior information about the unknown parameter } As opposed to ML and MAP estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾 } The observed samples 𝒠 convert the prior densities π‘ž 𝜾 into a posterior density π‘ž 𝜾|𝒠 } Keep track of beliefs about 𝜾 ’s values and uses these beliefs for reaching conclusions } In the Bayesian approach, we first specify π‘ž 𝜾|𝒠 and then we compute the predictive distribution π‘ž(π’š|𝒠) 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend