ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation
ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation
ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2 Relation
Outline
2
} Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference
Relation of learning & statistics
3
} Target model in the learning problems can be considered
as a statistical model
} For a fixed set of data and underlying target (statistical
model), the estimation methods try to estimate the target from the available data
Density estimation
4
} Estimating the probability density function π(π), given a
set of data points π %
%&' (
drawn from it.
} Main approaches of density estimation:
} Parametric: assuming a parameterized model for density
function
Β¨ A number of parameters are optimized by fitting the model to the data set
} Nonparametric (Instance-based): No specific parametric model
is assumed
} The form of the density function is determined entirely by the data
Parametric density estimation
5
} Estimating the probability density function π(π), given a
set of data points π %
%&' (
drawn from it.
} Assume that π(π) in terms of a specific functional form
which has a number of adjustable parameters.
} Methods for parameter estimation
} Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation
Parametric density estimation
6
} Goal: estimate parameters of a distribution from a dataset π
= {π ' , . . . , π(()}
} π contains π
independent, identically distributed (i.i.d.) training samples.
} We need to determine πΎ given {π ' , β¦ , π(()}
} How to represent πΎ?
} πΎβ or π(πΎ)?
Example
7
π π¦ π = π(π¦|π, 1)
Example
8
Maximum Likelihood Estimation (MLE)
9
} Maximum-likelihood
estimation (MLE) is a method
- f
estimating the parameters of a statistical model given data.
} Likelihood is the conditional probability of observations π
= π('), π(9), β¦ , π(() given the value of parameters πΎ
} Assuming i.i.d. observations:
π π πΎ = : π(π(%)|πΎ)
( %&'
} Maximum Likelihood estimation
πΎ ;<= = argmax
πΎ
π π πΎ
likelihood of πΎ w.r.t. the samples
Maximum Likelihood Estimation (MLE)
10
π D best agrees with the observed samples
Maximum Likelihood Estimation (MLE)
11
π D best agrees with the observed samples
Maximum Likelihood Estimation (MLE)
12
π D best agrees with the observed samples
Maximum Likelihood Estimation (MLE)
13
β πΎ = ln π π πΎ = ln : π π(%) πΎ
( %&'
= H ln π π(%) πΎ
( %&'
πΎ ;<= = argmax
πΎ
β(πΎ) = argmax
πΎ
H ln π π(%) πΎ
( %&'
} Thus, we solve πΌπΎβ πΎ = π } to find global optimum
MLE Bernoulli
} Given: π = π¦('), π¦(9), β¦ , π¦(() , π heads (1), π β π tails (0)
π π¦ π = πM 1 β π 'NM π π π = : π(π¦ % |π)
( %&'
= : πM O 1 β π 'NM O
( %&'
ln π π π = H ln π(π¦ % |π)
( %&'
= H{π¦ % ln π + (1 β π¦ % ) ln 1 β π }
( %&'
π ln π π π ππ = 0 β π<= = β π¦(%)
( %&'
π = π π
14
MLE Bernoulli: example
15
} Example: π = {1,1,1}, π
D<= =
U U = 1 } Prediction: all future tosses will land heads up
} Overfitting to π
MLE: Multinomial distribution
16
} Multinomial distribution (on variable with πΏ state):
π π πΎ = : πW
MX Y W&'
Parameter space: πΎ = π', β¦ , πY π% β 0,1 H πW
Y W&'
= 1 π = π¦', β¦ , π¦Y π¦W β {0,1} H π¦W
Y W&'
= 1
π π¦W = 1 = πW π' π9 πU
MLE: Multinomial distribution
17
π = π('), π(9), β¦ , π(()
π π πΎ = : π(π % |πΎ)
( %&'
= : : πW
MX
(O)
Y W&'
= : πW
β MX
(O) [ O\]
Y W&' ( %&'
β πΎ, π = ln π π πΎ + π(1 β H πW
Y W&'
) π _W = β π¦W
(%) ( %&'
π = πW π
πW = H π¦W
(%) ( %&'
H πW
Y W&'
= π
MLE Gaussian: unknown π
18
π π¦ π = 1 2π
- π
πN '
9ef MNg f
ln π(π¦ % |π) = β ln 2π
- π β 1
2π9 π¦ % β π
9
πβ π ππ = 0 β π ππ H ln π π¦(%) π
( %&'
= 0 β H 1 π9 π¦ % β π
( %&'
= 0 β πΜ<= = 1 π H π¦ %
( %&'
MLE corresponds to many well-known estimation methods.
MLE Gaussian: unknown π and π
19
πΌπΎβ πΎ = π πβ π, π ππ = 0 β πΜ<= = 1 π H π¦ %
( %&'
πβ π, π ππ = 0 β π iπ<= = 1 π H π¦ % β πΜ<=
9 ( %&'
πΎ = π, π
Maximum A Posteriori (MAP) estimation
20
} MAP estimation
πΎ ;<kl = argmax
πΎ
π πΎ π
} Since π πΎ|π β π π |πΎ π(πΎ)
πΎ ;<kl = argmax
πΎ
π π πΎ π(πΎ)
} Example of prior distribution:
π π = πͺ(πo, π9)
MAP estimation Gaussian: unknown π
21
π(π¦|π)~π(π, π9) π(π|πo)~π(πo, πo
9)
π ππ ln π(π) : π π¦ % π
( %&'
= 0 β H 1 π9 π¦ % β π
( %&'
β 1 πo
9 π β πo = 0
β π i<kl = πo + πo
9
π9 β π¦ %
( %&'
1 + πo
9
π9 π
er
f
ef β« 1 or π β β β πΜ<kl = πΜ<= = β M O
[ O\]
( π is the only unknown parameter πo and πo are known
Maximum A Posteriori (MAP) estimation
22
} Given a set of observations π and a prior distribution
π(πΎ) on parameters, the parameter vector that maximizes π π πΎ π(πΎ) is found.
π π π π π π π D<kl β π D<= π D<kl > π D<= π( = π9 ππo
9 + π9 πo +
ππo
9
ππo
9 + π9 π<=
MAP estimation Gaussian: unknown π (known π)
23
More samples βΉ sharper π(π|π ) Higher confidence in estimation π π π β π π π(π |π) π π π = π π π(, π( π( = πo + πo
9
π9 β π¦ %
( %&'
1 + πo
9
π9 π 1 π(
9 = 1
πo
9 + π
π9 π(π) [Bishop]
Conjugate Priors
24
} We consider a form of prior distribution that has a simple
interpretation as well as some useful analytical properties
} Choosing a prior such that the posterior distribution that
is proportional to π(π |πΎ)π(πΎ) will have the same functional form as the prior. βπ·, π βπ·| π(πΎ|π·|) β π π πΎ π(πΎ|π·)
Having the same functional form
Prior for Bernoulli Likelihood
} Beta distribution over π β [0,1]:
Beta π π½', π½o β πΖ]N' 1 β π ΖrN' Beta π π½', π½o = Ξ(π½o + π½') Ξ(π½o)Ξ(π½') πΖ]N' 1 β π ΖrN'
} Beta distribution is the conjugate prior of Bernoulli:
π π¦ π = πM 1 β π 'NM
πΉ π = π½' π½o + π½' π D = π½' β 1 π½o β 1 + π½' β 1 most probable π
25
Beta distribution
26
Benoulli likelihood: posterior
Given: π = π¦('), π¦(9), β¦ , π¦(() , π heads (1), π β π tails (0)
π π π β π π π π(π) = : πM O 1 β π
'NM O ( %&'
Beta π π½', π½o β πβ β‘Ζ]N' 1 β π (Nβ β‘ΖrN' β π π π β πΆππ’π π π½'
|, π½o |
π½'
| = π½' + π
π½o
| = π½o + π β π β πΖ]N' 1 β π ΖrN'
27
π = H π¦(%)
( %&'
Example
28
Bernoulli π½o = π½' = 2 π = 1,1,1 β π = 3, π = 3 π D<kl = argmax
Ε
π π π = π½'
| β 1
π½'
| β 1 + π½o | β 1 = 4
5 Posterior Beta:π½'
| = 5, π½o | = 2
Prior Beta: π½o = π½' = 2 π π π π¦ = 1 π π Given: π = π¦('), π¦(9), β¦ , π¦(() : π heads (1), π β π tails (0) π π¦ π = πM 1 β π 'NM
Toss example
29
} MAP estimation can avoid overfitting
} π = {1,1,1}, π
D<= = 1
} π
D<kl = 0.8 (with prior π π = Beta π 2,2 )
Bayesian inference
30
} Parameters πΎ as random variables with a priori distribution
} Bayesian estimation utilizes the available prior information about the
unknown parameter
} As opposed to ML and MAP estimation, it does not seek a specific point
estimate of the unknown parameter vector πΎ
} The observed samples π convert the prior densities π πΎ into
a posterior density π πΎ|π
} Keep track of beliefs about πΎβs values and uses these beliefs for reaching
conclusions
} In the Bayesian approach, we first specify π πΎ|π and then we compute
the predictive distribution π(π|π )
Bayesian estimation: predictive distribution
31
} Given a set of samples π = π %
%&' ( , a prior distribution on
the parameters π(πΎ), and the form of the distribution π π πΎ
} We find π πΎ|π and then use it to specify π
_ π = π(π|π ) as an estimate of π(π):
π π π = β’ π π, πΎ|π ππΎ
- = β’ π π π , πΎ π πΎ|π ππΎ
- = β’ π π πΎ π πΎ|π ππΎ
- } Analytical solutions exist for very special forms of the involved
functions
Predictive distribution If we know the value of the parameters πΎ, we know exactly the distribution of π
Benoulli likelihood: prediction
32
} Training samples: π = π¦('), β¦ , π¦(()
π π = πΆππ’π π π½', π½o β πΖ]N' 1 β π ΖrN' π π|π = πΆππ’π π π½' + π, π½o + π β π β πΖ]β‘β N' 1 β π Ζrβ‘ (Nβ N' π π¦|π = β’ π π¦|π
- π π|π ππ
= πΉl Ε|π π(π¦|π) β π π¦ = 1|π = πΉl Ε|π π = π½' + π π½o + π½' + π
ML, MAP, and Bayesian Estimation
33
} If π πΎ|π
has a sharp peak at πΎ = πΎ ; (i.e., π πΎ|π β π(πΎ, πΎ ;)), then π π|π β π π|πΎ ;
} In this case, the Bayesian estimation will be approximately equal
to the MAP estimation.
} If π π |πΎ is concentrated around a sharp peak and π(πΎ)
is broad enough around this peak, the ML, MAP, and Bayesian estimations yield approximately the same result.
} All three methods asymptotically (π β β) results in the
same estimate
Summary
34
} ML and MAP result in a single (point) estimate of the unknown
parameters vector.
} More simple and interpretable than Bayesian estimation
} Bayesian approach finds a predictive distribution using all the
available information:
} expected to give better results } needs higher computational complexity
} Bayesian methods have gained a lot of popularity over the
recent decade due to the advances in computer technology.
} All three methods asymptotically (π β β) results in the same
estimate.
Resource
35