ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

β–Ά
ml map estimation and bayesian
SMART_READER_LITE
LIVE PREVIEW

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2 Relation


slide-1
SLIDE 1

ML, MAP Estimation and Bayesian

CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

slide-2
SLIDE 2

Outline

2

} Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference

slide-3
SLIDE 3

Relation of learning & statistics

3

} Target model in the learning problems can be considered

as a statistical model

} For a fixed set of data and underlying target (statistical

model), the estimation methods try to estimate the target from the available data

slide-4
SLIDE 4

Density estimation

4

} Estimating the probability density function π‘ž(π’š), given a

set of data points π’š %

%&' (

drawn from it.

} Main approaches of density estimation:

} Parametric: assuming a parameterized model for density

function

Β¨ A number of parameters are optimized by fitting the model to the data set

} Nonparametric (Instance-based): No specific parametric model

is assumed

} The form of the density function is determined entirely by the data

slide-5
SLIDE 5

Parametric density estimation

5

} Estimating the probability density function π‘ž(π’š), given a

set of data points π’š %

%&' (

drawn from it.

} Assume that π‘ž(π’š) in terms of a specific functional form

which has a number of adjustable parameters.

} Methods for parameter estimation

} Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation

slide-6
SLIDE 6

Parametric density estimation

6

} Goal: estimate parameters of a distribution from a dataset 𝒠

= {π’š ' , . . . , π’š(()}

} 𝒠 contains 𝑂

independent, identically distributed (i.i.d.) training samples.

} We need to determine 𝜾 given {π’š ' , … , π’š(()}

} How to represent 𝜾?

} πœΎβˆ— or π‘ž(𝜾)?

slide-7
SLIDE 7

Example

7

𝑄 𝑦 𝜈 = 𝑂(𝑦|𝜈, 1)

slide-8
SLIDE 8

Example

8

slide-9
SLIDE 9

Maximum Likelihood Estimation (MLE)

9

} Maximum-likelihood

estimation (MLE) is a method

  • f

estimating the parameters of a statistical model given data.

} Likelihood is the conditional probability of observations 𝒠

= π’š('), π’š(9), … , π’š(() given the value of parameters 𝜾

} Assuming i.i.d. observations:

π‘ž 𝒠 𝜾 = : π‘ž(π’š(%)|𝜾)

( %&'

} Maximum Likelihood estimation

𝜾 ;<= = argmax

𝜾

π‘ž 𝒠 𝜾

likelihood of 𝜾 w.r.t. the samples

slide-10
SLIDE 10

Maximum Likelihood Estimation (MLE)

10

πœ„ D best agrees with the observed samples

slide-11
SLIDE 11

Maximum Likelihood Estimation (MLE)

11

πœ„ D best agrees with the observed samples

slide-12
SLIDE 12

Maximum Likelihood Estimation (MLE)

12

πœ„ D best agrees with the observed samples

slide-13
SLIDE 13

Maximum Likelihood Estimation (MLE)

13

β„’ 𝜾 = ln π‘ž 𝒠 𝜾 = ln : π‘ž π’š(%) 𝜾

( %&'

= H ln π‘ž π’š(%) 𝜾

( %&'

𝜾 ;<= = argmax

𝜾

β„’(𝜾) = argmax

𝜾

H ln π‘ž π’š(%) 𝜾

( %&'

} Thus, we solve π›ΌπœΎβ„’ 𝜾 = 𝟏 } to find global optimum

slide-14
SLIDE 14

MLE Bernoulli

} Given: 𝒠 = 𝑦('), 𝑦(9), … , 𝑦(() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0)

π‘ž 𝑦 πœ„ = πœ„M 1 βˆ’ πœ„ 'NM π‘ž 𝒠 πœ„ = : π‘ž(𝑦 % |πœ„)

( %&'

= : πœ„M O 1 βˆ’ πœ„ 'NM O

( %&'

ln π‘ž 𝒠 πœ„ = H ln π‘ž(𝑦 % |πœ„)

( %&'

= H{𝑦 % ln πœ„ + (1 βˆ’ 𝑦 % ) ln 1 βˆ’ πœ„ }

( %&'

πœ– ln π‘ž 𝒠 πœ„ πœ–πœ„ = 0 β‡’ πœ„<= = βˆ‘ 𝑦(%)

( %&'

𝑂 = 𝑛 𝑂

14

slide-15
SLIDE 15

MLE Bernoulli: example

15

} Example: 𝒠 = {1,1,1}, πœ„

D<= =

U U = 1 } Prediction: all future tosses will land heads up

} Overfitting to 𝒠

slide-16
SLIDE 16

MLE: Multinomial distribution

16

} Multinomial distribution (on variable with 𝐿 state):

𝑄 π’š 𝜾 = : πœ„W

MX Y W&'

Parameter space: 𝜾 = πœ„', … , πœ„Y πœ„% ∈ 0,1 H πœ„W

Y W&'

= 1 π’š = 𝑦', … , 𝑦Y 𝑦W ∈ {0,1} H 𝑦W

Y W&'

= 1

𝑄 𝑦W = 1 = πœ„W πœ„' πœ„9 πœ„U

slide-17
SLIDE 17

MLE: Multinomial distribution

17

𝒠 = π’š('), π’š(9), … , π’š(()

𝑄 𝒠 𝜾 = : 𝑄(π’š % |𝜾)

( %&'

= : : πœ„W

MX

(O)

Y W&'

= : πœ„W

βˆ‘ MX

(O) [ O\]

Y W&' ( %&'

β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’ H πœ„W

Y W&'

) πœ„ _W = βˆ‘ 𝑦W

(%) ( %&'

𝑂 = 𝑂W 𝑂

𝑂W = H 𝑦W

(%) ( %&'

H 𝑂W

Y W&'

= 𝑂

slide-18
SLIDE 18

MLE Gaussian: unknown 𝜈

18

π‘ž 𝑦 𝜈 = 1 2𝜌

  • 𝜏

𝑓N '

9ef MNg f

ln π‘ž(𝑦 % |𝜈) = βˆ’ ln 2𝜌

  • 𝜏 βˆ’ 1

2𝜏9 𝑦 % βˆ’ 𝜈

9

πœ–β„’ 𝜈 πœ–πœˆ = 0 β‡’ πœ– πœ–πœˆ H ln π‘ž 𝑦(%) 𝜈

( %&'

= 0 β‡’ H 1 𝜏9 𝑦 % βˆ’ 𝜈

( %&'

= 0 β‡’ πœˆΜ‚<= = 1 𝑂 H 𝑦 %

( %&'

MLE corresponds to many well-known estimation methods.

slide-19
SLIDE 19

MLE Gaussian: unknown 𝜈 and 𝜏

19

π›ΌπœΎβ„’ 𝜾 = 𝟏 πœ–β„’ 𝜈, 𝜏 πœ–πœˆ = 0 β‡’ πœˆΜ‚<= = 1 𝑂 H 𝑦 %

( %&'

πœ–β„’ 𝜈, 𝜏 πœ–πœ = 0 β‡’ 𝜏 iπŸ‘<= = 1 𝑂 H 𝑦 % βˆ’ πœˆΜ‚<=

9 ( %&'

𝜾 = 𝜈, 𝜏

slide-20
SLIDE 20

Maximum A Posteriori (MAP) estimation

20

} MAP estimation

𝜾 ;<kl = argmax

𝜾

π‘ž 𝜾 𝒠

} Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾)

𝜾 ;<kl = argmax

𝜾

π‘ž 𝒠 𝜾 π‘ž(𝜾)

} Example of prior distribution:

π‘ž πœ„ = π’ͺ(πœ„o, 𝜏9)

slide-21
SLIDE 21

MAP estimation Gaussian: unknown 𝜈

21

π‘ž(𝑦|𝜈)~𝑂(𝜈, 𝜏9) π‘ž(𝜈|𝜈o)~𝑂(𝜈o, 𝜏o

9)

𝑒 π‘’πœˆ ln π‘ž(𝜈) : π‘ž 𝑦 % 𝜈

( %&'

= 0 β‡’ H 1 𝜏9 𝑦 % βˆ’ 𝜈

( %&'

βˆ’ 1 𝜏o

9 𝜈 βˆ’ 𝜈o = 0

β‡’ 𝜈 i<kl = 𝜈o + 𝜏o

9

𝜏9 βˆ‘ 𝑦 %

( %&'

1 + 𝜏o

9

𝜏9 𝑂

er

f

ef ≫ 1 or 𝑂 β†’ ∞ β‡’ πœˆΜ‚<kl = πœˆΜ‚<= = βˆ‘ M O

[ O\]

( 𝜈 is the only unknown parameter 𝜈o and 𝜏o are known

slide-22
SLIDE 22

Maximum A Posteriori (MAP) estimation

22

} Given a set of observations 𝒠 and a prior distribution

π‘ž(𝜾) on parameters, the parameter vector that maximizes π‘ž 𝒠 𝜾 π‘ž(𝜾) is found.

π‘ž 𝒠 πœ„ π‘ž 𝒠 πœ„ πœ„ D<kl β‰… πœ„ D<= πœ„ D<kl > πœ„ D<= 𝜈( = 𝜏9 π‘‚πœo

9 + 𝜏9 𝜈o +

π‘‚πœo

9

π‘‚πœo

9 + 𝜏9 𝜈<=

slide-23
SLIDE 23

MAP estimation Gaussian: unknown 𝜈 (known 𝜏)

23

More samples ⟹ sharper π‘ž(𝜈|𝒠) Higher confidence in estimation π‘ž 𝜈 𝒠 ∝ π‘ž 𝜈 π‘ž(𝒠|𝜈) π‘ž 𝜈 𝒠 = 𝑂 𝜈 𝜈(, 𝜏( 𝜈( = 𝜈o + 𝜏o

9

𝜏9 βˆ‘ 𝑦 %

( %&'

1 + 𝜏o

9

𝜏9 𝑂 1 𝜏(

9 = 1

𝜏o

9 + 𝑂

𝜏9 π‘ž(𝜈) [Bishop]

slide-24
SLIDE 24

Conjugate Priors

24

} We consider a form of prior distribution that has a simple

interpretation as well as some useful analytical properties

} Choosing a prior such that the posterior distribution that

is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ·| 𝑄(𝜾|𝜷|) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷)

Having the same functional form

slide-25
SLIDE 25

Prior for Bernoulli Likelihood

} Beta distribution over πœ„ ∈ [0,1]:

Beta πœ„ 𝛽', 𝛽o ∝ πœ„Ζ’]N' 1 βˆ’ πœ„ Ζ’rN' Beta πœ„ 𝛽', 𝛽o = Ξ“(𝛽o + 𝛽') Ξ“(𝛽o)Ξ“(𝛽') πœ„Ζ’]N' 1 βˆ’ πœ„ Ζ’rN'

} Beta distribution is the conjugate prior of Bernoulli:

𝑄 𝑦 πœ„ = πœ„M 1 βˆ’ πœ„ 'NM

𝐹 πœ„ = 𝛽' 𝛽o + 𝛽' πœ„ D = 𝛽' βˆ’ 1 𝛽o βˆ’ 1 + 𝛽' βˆ’ 1 most probable πœ„

25

slide-26
SLIDE 26

Beta distribution

26

slide-27
SLIDE 27

Benoulli likelihood: posterior

Given: 𝒠 = 𝑦('), 𝑦(9), … , 𝑦(() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0)

π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) = : πœ„M O 1 βˆ’ πœ„

'NM O ( %&'

Beta πœ„ 𝛽', 𝛽o ∝ πœ„β€ β€‘Ζ’]N' 1 βˆ’ πœ„ (N†‑ƒrN' β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽'

|, 𝛽o |

𝛽'

| = 𝛽' + 𝑛

𝛽o

| = 𝛽o + 𝑂 βˆ’ 𝑛 ∝ πœ„Ζ’]N' 1 βˆ’ πœ„ Ζ’rN'

27

𝑛 = H 𝑦(%)

( %&'

slide-28
SLIDE 28

Example

28

Bernoulli 𝛽o = 𝛽' = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 πœ„ D<kl = argmax

Ε’

𝑄 πœ„ 𝒠 = 𝛽'

| βˆ’ 1

𝛽'

| βˆ’ 1 + 𝛽o | βˆ’ 1 = 4

5 Posterior Beta:𝛽'

| = 5, 𝛽o | = 2

Prior Beta: 𝛽o = 𝛽' = 2 πœ„ πœ„ π‘ž 𝑦 = 1 πœ„ πœ„ Given: 𝒠 = 𝑦('), 𝑦(9), … , 𝑦(() : 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž 𝑦 πœ„ = πœ„M 1 βˆ’ πœ„ 'NM

slide-29
SLIDE 29

Toss example

29

} MAP estimation can avoid overfitting

} 𝒠 = {1,1,1}, πœ„

D<= = 1

} πœ„

D<kl = 0.8 (with prior π‘ž πœ„ = Beta πœ„ 2,2 )

slide-30
SLIDE 30

Bayesian inference

30

} Parameters 𝜾 as random variables with a priori distribution

} Bayesian estimation utilizes the available prior information about the

unknown parameter

} As opposed to ML and MAP estimation, it does not seek a specific point

estimate of the unknown parameter vector 𝜾

} The observed samples 𝒠 convert the prior densities π‘ž 𝜾 into

a posterior density π‘ž 𝜾|𝒠

} Keep track of beliefs about πœΎβ€™s values and uses these beliefs for reaching

conclusions

} In the Bayesian approach, we first specify π‘ž 𝜾|𝒠 and then we compute

the predictive distribution π‘ž(π’š|𝒠)

slide-31
SLIDE 31

Bayesian estimation: predictive distribution

31

} Given a set of samples 𝒠 = π’š %

%&' ( , a prior distribution on

the parameters 𝑄(𝜾), and the form of the distribution 𝑄 π’š 𝜾

} We find 𝑄 𝜾|𝒠 and then use it to specify 𝑄

_ π’š = 𝑄(π’š|𝒠) as an estimate of 𝑄(π’š):

𝑄 π’š 𝒠 = β€’ 𝑄 π’š, 𝜾|𝒠 π‘’πœΎ

  • = β€’ 𝑄 π’š 𝒠, 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ
  • = β€’ 𝑄 π’š 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ
  • } Analytical solutions exist for very special forms of the involved

functions

Predictive distribution If we know the value of the parameters 𝜾, we know exactly the distribution of π’š

slide-32
SLIDE 32

Benoulli likelihood: prediction

32

} Training samples: 𝒠 = 𝑦('), … , 𝑦(()

𝑄 πœ„ = 𝐢𝑓𝑒𝑏 πœ„ 𝛽', 𝛽o ∝ πœ„Ζ’]N' 1 βˆ’ πœ„ Ζ’rN' 𝑄 πœ„|𝒠 = 𝐢𝑓𝑒𝑏 πœ„ 𝛽' + 𝑛, 𝛽o + 𝑂 βˆ’ 𝑛 ∝ πœ„Ζ’]‑†N' 1 βˆ’ πœ„ Ζ’r‑ (N† N' 𝑄 𝑦|𝒠 = β€’ 𝑄 𝑦|πœ„

  • 𝑄 πœ„|𝒠 π‘’πœ„

= 𝐹l Ε’|𝒠 𝑄(𝑦|πœ„) β‡’ 𝑄 𝑦 = 1|𝒠 = 𝐹l Ε’|𝒠 πœ„ = 𝛽' + 𝑛 𝛽o + 𝛽' + 𝑂

slide-33
SLIDE 33

ML, MAP, and Bayesian Estimation

33

} If π‘ž 𝜾|𝒠

has a sharp peak at 𝜾 = 𝜾 ; (i.e., π‘ž 𝜾|𝒠 β‰ˆ πœ€(𝜾, 𝜾 ;)), then π‘ž π’š|𝒠 β‰ˆ π‘ž π’š|𝜾 ;

} In this case, the Bayesian estimation will be approximately equal

to the MAP estimation.

} If π‘ž 𝒠|𝜾 is concentrated around a sharp peak and π‘ž(𝜾)

is broad enough around this peak, the ML, MAP, and Bayesian estimations yield approximately the same result.

} All three methods asymptotically (𝑂 β†’ ∞) results in the

same estimate

slide-34
SLIDE 34

Summary

34

} ML and MAP result in a single (point) estimate of the unknown

parameters vector.

} More simple and interpretable than Bayesian estimation

} Bayesian approach finds a predictive distribution using all the

available information:

} expected to give better results } needs higher computational complexity

} Bayesian methods have gained a lot of popularity over the

recent decade due to the advances in computer technology.

} All three methods asymptotically (𝑂 β†’ ∞) results in the same

estimate.

slide-35
SLIDE 35

Resource

35

} C. Bishop, β€œPattern Recognition and Machine Learning”,

Chapter 2.