ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2

Relation of learning & statistics } Target model in the learning problems can be considered as a statistical model } For a fixed set of data and underlying target (statistical model), the estimation methods try to estimate the target from the available data 3

Density estimation } Estimating the probability density function 𝑞(𝒚) , given a ( set of data points 𝒚 % drawn from it. %&' } Main approaches of density estimation: } Parametric: assuming a parameterized model for density function ¨ A number of parameters are optimized by fitting the model to the data set } Nonparametric (Instance-based): No specific parametric model is assumed } The form of the density function is determined entirely by the data 4

Parametric density estimation } Estimating the probability density function 𝑞(𝒚) , given a ( set of data points 𝒚 % drawn from it. %&' } Assume that 𝑞(𝒚) in terms of a specific functional form which has a number of adjustable parameters. } Methods for parameter estimation } Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation 5

Parametric density estimation } Goal: estimate parameters of a distribution from a dataset 𝒠 = {𝒚 ' , . . . , 𝒚 (() } } 𝒠 contains 𝑂 independent, identically distributed (i.i.d.) training samples. } We need to determine 𝜾 given {𝒚 ' , … , 𝒚 (() } } How to represent 𝜾 ? } 𝜾 ∗ or 𝑞(𝜾) ? 6

Example 𝑄 𝑦 𝜈 = 𝑂(𝑦|𝜈, 1) 7

Example 8

Maximum Likelihood Estimation (MLE) } Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data. } Likelihood is the conditional probability of observations 𝒠 = 𝒚 (') , 𝒚 (9) , … , 𝒚 (() given the value of parameters 𝜾 } Assuming i.i.d. observations: ( 𝑞 𝒠 𝜾 = : 𝑞(𝒚 (%) |𝜾) %&' likelihood of 𝜾 w.r.t. the samples } Maximum Likelihood estimation ; <= = argmax 𝜾 𝑞 𝒠 𝜾 𝜾 9

Maximum Likelihood Estimation (MLE) D best agrees with the observed samples 𝜄 10

Maximum Likelihood Estimation (MLE) ( ( ℒ 𝜾 = ln 𝑞 𝒠 𝜾 = ln : 𝑞 𝒚 (%) 𝜾 = H ln 𝑞 𝒚 (%) 𝜾 %&' %&' ( H ln 𝑞 𝒚 (%) 𝜾 ; <= = argmax 𝜾 ℒ(𝜾) = argmax 𝜾 𝜾 %&' } Thus, we solve 𝛼 𝜾 ℒ 𝜾 = 𝟏 } to find global optimum 13

MLE Bernoulli } Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 − 𝑛 tails (0) 𝑞 𝑦 𝜄 = 𝜄 M 1 − 𝜄 'NM ( ( = : 𝜄 M O 1 − 𝜄 'NM O 𝑞 𝒠 𝜄 = : 𝑞(𝑦 % |𝜄) %&' %&' ( ( ln 𝑞 𝒠 𝜄 = H ln 𝑞(𝑦 % |𝜄) = H{𝑦 % ln 𝜄 + (1 − 𝑦 % ) ln 1 − 𝜄 } %&' %&' ( 𝑦 (%) = 0 ⇒ 𝜄 <= = ∑ 𝜖 ln 𝑞 𝒠 𝜄 = 𝑛 %&' 𝜖𝜄 𝑂 𝑂 14

MLE Bernoulli: example U D <= = } Example: 𝒠 = {1,1,1} , 𝜄 U = 1 } Prediction: all future tosses will land heads up } Overfitting to 𝒠 15

MLE: Multinomial distribution } Multinomial distribution (on variable with 𝐿 state): Parameter space: 𝜾 Y M X = 𝜄 ' , … , 𝜄 Y 𝑄 𝒚 𝜾 = : 𝜄 W 𝜄 % ∈ 0,1 W&' Y H 𝜄 W = 1 𝑄 𝑦 W = 1 = 𝜄 W W&' 𝜄 9 𝒚 = 𝑦 ' , … , 𝑦 Y 𝑦 W ∈ {0,1} Y H 𝑦 W = 1 𝜄 ' W&' 𝜄 U 16

MLE: Multinomial distribution 𝒠 = 𝒚 (') , 𝒚 (9) , … , 𝒚 (() ( ( Y Y (O) (O) [ ∑ M X M X 𝑄 𝒠 𝜾 = : 𝑄(𝒚 % |𝜾) O\] = : : 𝜄 W = : 𝜄 W W&' W&' ( %&' %&' (%) 𝑂 W = H 𝑦 W %&' Y Y H 𝑂 W = 𝑂 ℒ 𝜾, 𝜇 = ln 𝑞 𝒠 𝜾 + 𝜇(1 − H 𝜄 W ) W&' W&' (%) ( ∑ 𝑦 W = 𝑂 W %&' _ W = 𝜄 𝑂 𝑂 17

� � MLE Gaussian: unknown 𝜈 1 𝑓 N ' 9e f MNg f 𝑞 𝑦 𝜈 = 2𝜌 𝜏 𝜏 − 1 9 ln 𝑞(𝑦 % |𝜈) = − ln 2𝜏 9 𝑦 % − 𝜈 2𝜌 ( ( 𝜖ℒ 𝜈 = 0 ⇒ 𝜖 = 0 ⇒ H 1 H ln 𝑞 𝑦 (%) 𝜈 𝜏 9 𝑦 % − 𝜈 𝜖𝜈 𝜖𝜈 %&' %&' ( = 0 ⇒ 𝜈̂ <= = 1 𝑂 H 𝑦 % %&' MLE corresponds to many well-known estimation methods. 18

MLE Gaussian: unknown 𝜈 and 𝜏 𝜾 = 𝜈, 𝜏 𝛼 𝜾 ℒ 𝜾 = 𝟏 ( 𝜖ℒ 𝜈, 𝜏 = 0 ⇒ 𝜈̂ <= = 1 𝑂 H 𝑦 % 𝜖𝜈 %&' ( 𝜖ℒ 𝜈, 𝜏 i 𝟑<= = 1 9 𝑂 H 𝑦 % − 𝜈̂ <= = 0 ⇒ 𝜏 𝜖𝜏 %&' 19

Maximum A Posteriori (MAP) estimation } MAP estimation ; <kl = argmax 𝜾 𝑞 𝜾 𝒠 𝜾 } Since 𝑞 𝜾|𝒠 ∝ 𝑞 𝒠|𝜾 𝑞(𝜾) ; <kl = argmax 𝜾 𝑞 𝒠 𝜾 𝑞(𝜾) 𝜾 } Example of prior distribution: 𝑞 𝜄 = 𝒪(𝜄 o , 𝜏 9 ) 20

MAP estimation Gaussian: unknown 𝜈 𝑞(𝑦|𝜈)~𝑂(𝜈, 𝜏 9 ) 𝜈 is the only unknown parameter 9 ) 𝜈 o and 𝜏 o are known 𝑞(𝜈|𝜈 o )~𝑂(𝜈 o , 𝜏 o ( 𝑒 𝑒𝜈 ln 𝑞(𝜈) : 𝑞 𝑦 % 𝜈 = 0 %&' ( ⇒ H 1 − 1 𝜏 9 𝑦 % − 𝜈 9 𝜈 − 𝜈 o = 0 𝜏 o %&' 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 ∑ %&' ⇒ 𝜈 i <kl = 9 1 + 𝜏 o 𝜏 9 𝑂 [ f M O ∑ e r e f ≫ 1 or 𝑂 → ∞ ⇒ 𝜈̂ <kl = 𝜈̂ <= = O\] ( 21

Maximum A Posteriori (MAP) estimation } Given a set of observations 𝒠 and a prior distribution 𝑞(𝜾) on parameters, the parameter vector that maximizes 𝑞 𝒠 𝜾 𝑞(𝜾) is found. 𝑞 𝒠 𝜄 𝑞 𝒠 𝜄 D <kl ≅ 𝜄 D <= D <kl > 𝜄 D <= 𝜄 𝜄 9 𝜏 9 𝑂𝜏 o 𝜈 ( = 9 + 𝜏 9 𝜈 o + 9 + 𝜏 9 𝜈 <= 𝑂𝜏 o 𝑂𝜏 o 22

MAP estimation Gaussian: unknown 𝜈 (known 𝜏 ) 𝑞 𝜈 𝒠 ∝ 𝑞 𝜈 𝑞(𝒠|𝜈) 𝑞 𝜈 𝒠 = 𝑂 𝜈 𝜈 ( , 𝜏 ( 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 ∑ %&' 𝜈 ( = 9 1 + 𝜏 o 𝜏 9 𝑂 𝑞(𝜈) 1 9 = 1 9 + 𝑂 𝜏 9 𝜏 ( 𝜏 o [Bishop] More samples ⟹ sharper 𝑞(𝜈|𝒠) Higher confidence in estimation 23

Conjugate Priors } We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties } Choosing a prior such that the posterior distribution that is proportional to 𝑞(𝒠|𝜾)𝑞(𝜾) will have the same functional form as the prior. ∀𝜷, 𝒠 ∃𝜷 | 𝑄(𝜾|𝜷 | ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 24

Prior for Bernoulli Likelihood 𝛽 ' 𝐹 𝜄 = 𝛽 o + 𝛽 ' } Beta distribution over 𝜄 ∈ [0,1] : 𝛽 ' − 1 D = 𝜄 𝛽 o − 1 + 𝛽 ' − 1 Beta 𝜄 𝛽 ' , 𝛽 o ∝ 𝜄 ƒ ] N' 1 − 𝜄 ƒ r N' most probable 𝜄 Beta 𝜄 𝛽 ' , 𝛽 o = Γ(𝛽 o + 𝛽 ' ) Γ(𝛽 o )Γ(𝛽 ' ) 𝜄 ƒ ] N' 1 − 𝜄 ƒ r N' } Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 𝜄 = 𝜄 M 1 − 𝜄 'NM 25

Beta distribution 26

Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 − 𝑛 tails (0) 𝑞 𝜄 𝒠 ∝ 𝑞 𝒠 𝜄 𝑞(𝜄) ( : 𝜄 M O 1 − 𝜄 'NM O = Beta 𝜄 𝛽 ' , 𝛽 o %&' ∝ 𝜄 †‡ƒ ] N' 1 − 𝜄 (N†‡ƒ r N' ∝ 𝜄 ƒ ] N' 1 − 𝜄 ƒ r N' ( 𝑛 = H 𝑦 (%) | , 𝛽 o | ⇒ 𝑞 𝜄 𝒠 ∝ 𝐶𝑓𝑢𝑏 𝜄 𝛽 ' %&' | = 𝛽 ' + 𝑛 𝛽 ' | = 𝛽 o + 𝑂 − 𝑛 𝛽 o 27

Example 𝑞 𝑦 𝜄 = 𝜄 M 1 − 𝜄 'NM Prior Beta: 𝛽 o = 𝛽 ' = 2 Bernoulli 𝑞 𝑦 = 1 𝜄 𝜄 𝜄 Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() : Posterior 𝑛 heads (1), 𝑂 − 𝑛 tails (0) | = 5, 𝛽 o | = 2 Beta: 𝛽 ' 𝛽 o = 𝛽 ' = 2 𝒠 = 1,1,1 ⇒ 𝑂 = 3, 𝑛 = 3 | − 1 𝛽 ' | − 1 = 4 D <kl = argmax 𝜄 𝑄 𝜄 𝒠 = | − 1 + 𝛽 o 𝛽 ' 5 Œ 𝜄 28

Toss example } MAP estimation can avoid overfitting D <= = 1 } 𝒠 = {1,1,1} , 𝜄 D <kl = 0.8 (with prior 𝑞 𝜄 = Beta 𝜄 2,2 ) } 𝜄 29

Bayesian inference } Parameters 𝜾 as random variables with a priori distribution } Bayesian estimation utilizes the available prior information about the unknown parameter } As opposed to ML and MAP estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾 } The observed samples 𝒠 convert the prior densities 𝑞 𝜾 into a posterior density 𝑞 𝜾|𝒠 } Keep track of beliefs about 𝜾 ’s values and uses these beliefs for reaching conclusions } In the Bayesian approach, we first specify 𝑞 𝜾|𝒠 and then we compute the predictive distribution 𝑞(𝒚|𝒠) 30

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2 Relation

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

MAP Estimation, Message Passing and Perfect Graphs Tony Jebara November 25, 2009 Background

MAP Estimation with Perfect Graphs Tony Jebara July 21, 2009 Background Matchings Perfect

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Detection and Estimation Theory Lecture 13 Mojtaba Soltanalian- UIC msol@uic.edu

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics

Identifying Parametric Prior Distributions Stephanie Kovalchik UCLA, Department of Biostatistics

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE

Outline Introduction and motivation Gauge-fermion theories Gauge-Yukawa theories Summary and

The Polarization Function, the QED Beta Function and the Muon Anomalous Magnetic Moment Johann

ANALYTICAL N-BPM METHOD ANALYTICAL N-BPM METHOD IMPROVING ACCURACY AND ROBUSTNESS OF LINEAR