Machine Learning Estimation Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Estimation Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

Machine Learning Estimation Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/ Agenda Agenda Introduction Maximum Likelihood Estimation Maximum A Posteriori Estimation Bayesian Estimators Sharif


slide-1
SLIDE 1

Machine Learning

Estimation

Hamid R. Rabiee

Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/

slide-2
SLIDE 2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Introduction  Maximum Likelihood Estimation  Maximum A Posteriori Estimation  Bayesian Estimators

slide-3
SLIDE 3

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Densi Density ty Esti Estimati mation

  • n

 Model the probability distribution p(x) of a random variable x, given a finite set x1, . . . , xN of observations.  The good estimator is:

 Unbiased: Sampling distribution of the estimator centers around the parameter value

 Efficient: Smallest possible standard error, compared to other estimators

 Methods for parameter estimation

 Maximum Likelihood Estimation (MLE)  Maximum A Posteriori estimation (MAP)

slide-4
SLIDE 4

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Likel Likelihood ihood Func Functi tion

  • n

 Consider n independent observations of x: x1, ..., xn, where x follows f (x; q). The joint pdf for the whole data sample is: Now evaluate this function with the data sample obtained and regard it as a function of the parameter(s). This is the likelihood function:

(xi constant)

slide-5
SLIDE 5

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (

  • n (MLE)

MLE)

 Likelihood function:  For each sample point 𝑦, let 𝜾(𝒚) be the parameter value at which 𝑀(𝜄|𝑦) attains its maximum as a function of 𝜄.  The MLE estimator of 𝜄 based on a sample 𝑦 is 𝜾(𝒚).  The MLE is the parameter point for which the observed sample is more likely.

slide-6
SLIDE 6

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (

  • n (MLE)

MLE)

 If the likelihood function is differentiable (in 𝜄𝑗), possible conditions for the MLE are the values (𝜄1,…,𝜄𝑙) that solve:  Note that the solutions are possible candidates. To find exact MLE we should check

slide-7
SLIDE 7

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Exa Exampl mple e 1

Adopted from slides of Harvard university

slide-8
SLIDE 8

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Exa Exampl mple e 2

 MLE for Gaussian with unknown mean

 Let 𝑦1,𝑦2,…,𝑦𝑜 be iid samples from 𝑂(𝜄,1) . Find and MLE of 𝜄.  Solution:

slide-9
SLIDE 9

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Maxim Maximum um Likel Likelihood ihood Esti Estimati mation (

  • n (MLE)

MLE)

 Sometimes it’s more convenient to use log likelihood .  Let 𝑦1,𝑦2,…,𝑦𝑜 be iid samples from Bernouli (𝑞), then the likelihood function is:  If 𝜄 is the MLE, then for any function 𝜐(𝜄) the MLE of 𝜐(𝜄) is 𝜐( 𝜄).

slide-10
SLIDE 10

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Exa Exampl mple e 3

 MLE for Gaussian with unknown mean and variance

 Let 𝒚𝟐, 𝒚𝟑, … , 𝒚𝑺 be iid samples from 𝑶(𝝂, 𝝉𝟑). Find the MLE for 𝜾 = (𝝂, 𝝉𝟑)

 Solution:  Prove that MLE for the variance of a Gaussian is biased!

slide-11
SLIDE 11

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Property Property of

  • f MLE

MLE

 To use two-variate calculus to verify that a function 𝐼(𝜄1,𝜄2) has a maximum at 𝜄 1,𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive:

slide-12
SLIDE 12

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Exa Exampl mple e 4

 MLE for Multinomial distribution (Hint: use Lagrange multipliers)

 Solution:

slide-13
SLIDE 13

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

MLE: MLE: Mul Multi tinomial nomial d dis istr tributi ibution

  • n

 To use two-variate calculus to verify that a function 𝐼(𝜄1,𝜄2) has a maximum at 𝜄 1,𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive:

slide-14
SLIDE 14

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Exa Exampl mple e 5

 MLE for uniform distribution 𝑽 𝟏, 𝜾

 Solution:

Indicator function

slide-15
SLIDE 15

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Maxim Maximum um A Posteri A Posteriori

  • ri Estim

Estimation ation

 Approximation:

 Instead of averaging over all parameter values  Consider only the most probable value (i.e., value with highest posterior probability)

 Usually a very good approximation, and much simpler  MAP value ≠ Expected value  MAP → ML for infinite data (as long as prior ≠ 0 everywhere)  Given a set of observations 𝒠 and a prior distribution on parameters, the parameter vector that maximizes 𝑞(𝒠|𝜾)𝑞(𝜾) is found.

slide-16
SLIDE 16

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Maxim Maximum um A Posteri A Posteriori

  • ri Estim

Estimation ation

 Priors:

 Uninformative priors: Uniform distribution  Conjugate priors: Closed-form representation

  • f posterior, P(q) and P(q|D) have the same form

Distribution Conjugate prior

Binomial Beta Multinomial Dirichlet

slide-17
SLIDE 17

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

MA MAP P VS.

  • VS. M

MLE LE

Adopted from slides of A. Zisserman

slide-18
SLIDE 18

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

MA MAP P VS.

  • VS. M

MLE LE

 MLE: Choose value that maximizes the probability of observed data:

  • Suffer from overfitting

 MAP: Choose value that is most probable given observed data and prior belief

  • Can avoid overfitting

 When MAP and MLE are the same?

slide-19
SLIDE 19

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Exa Exampl mple e 6

 MAP for Gaussian with unknown mean and having prior

 Let 𝒚𝟐, 𝒚𝟑, … , 𝒚𝑶 be iid samples from 𝑶(𝝂, 𝝉𝟑) and prior 𝑶(𝝂0 , 𝝉0

𝟑) Find

the MAP for 𝝂

 Solution:

slide-20
SLIDE 20

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Bayes Estim Bayes Estimators ators

 Suppose that we have a prior distribution for 𝜄: 𝜌(𝜄)  Let 𝑔(𝑦|𝜄) be the sampling distribution, then conditional distribution of 𝜄 given the sample 𝑦 is: where 𝑛(𝑦) is the marginal distribution of 𝑦:

slide-21
SLIDE 21

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

21

Exa Exampl mple e 7

 Estimation for Gaussian with unknown mean and having prior

 Let N iid samples from 𝒚𝒖 ~ 𝑶 (𝜾, 𝝉𝟏

𝟑 ) 𝒃𝒐𝒆 𝜾 ~ 𝑶 ( 𝝂, 𝝉𝟑 )

 Solution:

slide-22
SLIDE 22

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

22

Bayesi Bayesian an Es Estim timato ators rs

 Both ML and MAP return only single and specific values for the parameter Θ. Bayesian estimation, by contrast, calculates fully the posterior distribution Prob (Θ|X).  If: prior is well-behaved (i.e., does not assign 0 density to any “feasible” parameter value). Then: both MLE and Bayesian prediction converge to the same value as the number of training data increases.

slide-23
SLIDE 23

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

23

Any Q Any Questi uestion

  • n

End of Lecture 2 Thank you!

Spring 2015

http://ce.sharif.edu/courses//93-94/2/ce717-1/