Machine Learning
Estimation
Hamid R. Rabiee
Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/
Machine Learning Estimation Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation
Machine Learning Estimation Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/ Agenda Agenda Introduction Maximum Likelihood Estimation Maximum A Posteriori Estimation Bayesian Estimators Sharif
Hamid R. Rabiee
Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Agenda Agenda
Introduction Maximum Likelihood Estimation Maximum A Posteriori Estimation Bayesian Estimators
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
Densi Density ty Esti Estimati mation
Model the probability distribution p(x) of a random variable x, given a finite set x1, . . . , xN of observations. The good estimator is:
Unbiased: Sampling distribution of the estimator centers around the parameter value
Efficient: Smallest possible standard error, compared to other estimators
Methods for parameter estimation
Maximum Likelihood Estimation (MLE) Maximum A Posteriori estimation (MAP)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
Likel Likelihood ihood Func Functi tion
Consider n independent observations of x: x1, ..., xn, where x follows f (x; q). The joint pdf for the whole data sample is: Now evaluate this function with the data sample obtained and regard it as a function of the parameter(s). This is the likelihood function:
(xi constant)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (
MLE)
Likelihood function: For each sample point 𝑦, let 𝜾(𝒚) be the parameter value at which 𝑀(𝜄|𝑦) attains its maximum as a function of 𝜄. The MLE estimator of 𝜄 based on a sample 𝑦 is 𝜾(𝒚). The MLE is the parameter point for which the observed sample is more likely.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (
MLE)
If the likelihood function is differentiable (in 𝜄𝑗), possible conditions for the MLE are the values (𝜄1,…,𝜄𝑙) that solve: Note that the solutions are possible candidates. To find exact MLE we should check
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Exa Exampl mple e 1
Adopted from slides of Harvard university
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
Exa Exampl mple e 2
MLE for Gaussian with unknown mean
Let 𝑦1,𝑦2,…,𝑦𝑜 be iid samples from 𝑂(𝜄,1) . Find and MLE of 𝜄. Solution:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (
MLE)
Sometimes it’s more convenient to use log likelihood . Let 𝑦1,𝑦2,…,𝑦𝑜 be iid samples from Bernouli (𝑞), then the likelihood function is: If 𝜄 is the MLE, then for any function 𝜐(𝜄) the MLE of 𝜐(𝜄) is 𝜐( 𝜄).
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Exa Exampl mple e 3
MLE for Gaussian with unknown mean and variance
Let 𝒚𝟐, 𝒚𝟑, … , 𝒚𝑺 be iid samples from 𝑶(𝝂, 𝝉𝟑). Find the MLE for 𝜾 = (𝝂, 𝝉𝟑)
Solution: Prove that MLE for the variance of a Gaussian is biased!
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
Property Property of
MLE
To use two-variate calculus to verify that a function 𝐼(𝜄1,𝜄2) has a maximum at 𝜄 1,𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Exa Exampl mple e 4
MLE for Multinomial distribution (Hint: use Lagrange multipliers)
Solution:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
MLE: MLE: Mul Multi tinomial nomial d dis istr tributi ibution
To use two-variate calculus to verify that a function 𝐼(𝜄1,𝜄2) has a maximum at 𝜄 1,𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Exa Exampl mple e 5
MLE for uniform distribution 𝑽 𝟏, 𝜾
Solution:
Indicator function
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
Maxim Maximum um A Posteri A Posteriori
Estimation ation
Approximation:
Instead of averaging over all parameter values Consider only the most probable value (i.e., value with highest posterior probability)
Usually a very good approximation, and much simpler MAP value ≠ Expected value MAP → ML for infinite data (as long as prior ≠ 0 everywhere) Given a set of observations and a prior distribution on parameters, the parameter vector that maximizes 𝑞(|𝜾)𝑞(𝜾) is found.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Maxim Maximum um A Posteri A Posteriori
Estimation ation
Priors:
Uninformative priors: Uniform distribution Conjugate priors: Closed-form representation
Distribution Conjugate prior
Binomial Beta Multinomial Dirichlet
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
MA MAP P VS.
MLE LE
Adopted from slides of A. Zisserman
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
MA MAP P VS.
MLE LE
MLE: Choose value that maximizes the probability of observed data:
MAP: Choose value that is most probable given observed data and prior belief
When MAP and MLE are the same?
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Exa Exampl mple e 6
MAP for Gaussian with unknown mean and having prior
Let 𝒚𝟐, 𝒚𝟑, … , 𝒚𝑶 be iid samples from 𝑶(𝝂, 𝝉𝟑) and prior 𝑶(𝝂0 , 𝝉0
𝟑) Find
the MAP for 𝝂
Solution:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Bayes Estim Bayes Estimators ators
Suppose that we have a prior distribution for 𝜄: 𝜌(𝜄) Let 𝑔(𝑦|𝜄) be the sampling distribution, then conditional distribution of 𝜄 given the sample 𝑦 is: where 𝑛(𝑦) is the marginal distribution of 𝑦:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
Exa Exampl mple e 7
Estimation for Gaussian with unknown mean and having prior
Let N iid samples from 𝒚𝒖 ~ 𝑶 (𝜾, 𝝉𝟏
𝟑 ) 𝒃𝒐𝒆 𝜾 ~ 𝑶 ( 𝝂, 𝝉𝟑 )
Solution:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
Bayesi Bayesian an Es Estim timato ators rs
Both ML and MAP return only single and specific values for the parameter Θ. Bayesian estimation, by contrast, calculates fully the posterior distribution Prob (Θ|X). If: prior is well-behaved (i.e., does not assign 0 density to any “feasible” parameter value). Then: both MLE and Bayesian prediction converge to the same value as the number of training data increases.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
23
Any Q Any Questi uestion
Spring 2015
http://ce.sharif.edu/courses//93-94/2/ce717-1/