Estimating the parameters of some probability distributions: - - PowerPoint PPT Presentation

estimating the parameters of some probability
SMART_READER_LITE
LIVE PREVIEW

Estimating the parameters of some probability distributions: - - PowerPoint PPT Presentation

0. Estimating the parameters of some probability distributions: Exemplifications 1. Estimating the parameter of the Bernoulli distribution: the MLE and MAP approaches CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2 2. Suppose we


slide-1
SLIDE 1

Estimating the parameters of some probability distributions: Exemplifications

0.

slide-2
SLIDE 2

Estimating the parameter of the Bernoulli distribution: the MLE and MAP approaches

CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2

1.

slide-3
SLIDE 3

Suppose we observe the values of n i.i.d. (independent, identi- cally distributed) random variables X1, . . . , Xn drawn from a single Bernoulli distribution with parameter θ. In other words, for each Xi, we know that P(Xi = 1) = θ and P(Xi = 0) = 1 − θ. Our goal is to estimate the value of θ from the observed values of X1, . . . , Xn.

2.

slide-4
SLIDE 4

Reminder: Maximum Likelihood Estimation

For any hypothetical value ˆ θ, we can compute the probability of

  • bserving the outcome X1, . . . , Xn if the true parameter value θ

were equal to ˆ θ. This probability of the observed data is often called the data likeli- hood, and the function L(ˆ θ) that maps each ˆ θ to the corresponding likelihood is called the likelihood function. A natural way to estimate the unknown parameter θ is to choose the ˆ θ that maximizes the likelihood function. Formally, ˆ θMLE = argmax

ˆ θ

L(ˆ θ).

3.

slide-5
SLIDE 5
  • a. Write a formula for the likelihood function, L(ˆ

θ). Your function should depend on the random variables X1, . . . , Xn and the hypothetical parameter ˆ θ. Does the likelihood function depend on the order of the random variables? Solution: Since the Xi are independent, we have L(ˆ θ) = Pˆ

θ(X1, . . . , Xn) = n

  • i=1

θ(Xi) = n

  • i=1

(ˆ θXi · (1 − ˆ θ)1−Xi) = ˆ θ#{Xi=1} · (1 − ˆ θ)#{Xi=0}, where #{·} counts the number of Xi for which the condition in braces holds true. Note that in the third equality we used the trick Xi = I{Xi=1}. The likelihood function does not depend on the order of the data.

4.

slide-6
SLIDE 6
  • b. Suppose that n = 10 and the data

set contains six 1s and four 0s. Write a short computer program that plots the likelihood function of this data. For the plot, the x-axis should be ˆ θ, and the y-axis L(ˆ θ). Scale your y-axis so that you can see some variation in its value. Estimate ˆ θMLE by marking on the x- axis the value of ˆ θ that maximizes the likelihood. Solution:

0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.2 0.4 0.6 0.8 1

θ

MLE; n = 10, six 1s, four 0s 5.

slide-7
SLIDE 7
  • c. Find a closed-form formula for ˆ

θMLE, the MLE estimate of ˆ θ. Does the closed form agree with the plot? Solution: Let’s consider l(θ) = ln(L(θ)). Since the ln function is increasing, the ˆ θ that maximizes the log-likelihood is the same as the ˆ θ that maximizes the

  • likelihood. Using the properties of the ln function, we can rewrite l(ˆ

θ) as follows: l(ˆ θ) = ln(ˆ θn1 · (1 − ˆ θ)n0) = n1 ln(ˆ θ) + n0 ln(1 − ˆ θ). where n1

not.

= #{Xi = 1}, iar n0

not.

= #{Xi = 0}. Assuming that ˆ θ = 0 and ˆ θ = 1, the first and second derivatives of l are given by l′(ˆ θ) = n1 ˆ θ − n0 1 − ˆ θ and l′′(ˆ θ) = −n1 ˆ θ2 − n0 (1 − ˆ θ)2 Since l′′(ˆ θ) is always negative, the l function is concave, and we can find its maximizer by solving the equation l′(θ) = 0. The solution to this equation is given by ˆ θMLE = n1 n1 + n0 .

6.

slide-8
SLIDE 8
  • d. Create three more likelihood plots: one where n = 5 and the

data set contains three 1s and two 0s; one where n = 100 and the data set contains sixty 1s and fourty 0s; and one where n = 10 and there are five 1s and five 0s. Solution:

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.2 0.4 0.6 0.8 1

θ

MLE; n = 5, three 1s, two 0s 1e-30 2e-30 3e-30 4e-30 5e-30 6e-30 7e-30 0.2 0.4 0.6 0.8 1

θ

MLE; n = 100, sixty 1s, fourty 0s

7.

slide-9
SLIDE 9

Solution (to part d.):

0.0002 0.0004 0.0006 0.0008 0.001 0.2 0.4 0.6 0.8 1

θ

MLE; n = 10, five 1s, five 0s

e. Describe how the likelihood functions and maximum likelihood estimates compare for the different data sets. Solution (to part e.): The MLE is equal to the propor- tion of 1s observed in the data, so for the first three plots the MLE is always at 0.6, while for the last plot it is at 0.5. As the number of samples n in- creases, the likelihood function gets more peaked at its maximum value, and the values it takes on decrease.

8.

slide-10
SLIDE 10

Reminder: Maximum a Posteriori Probability Estimation

In the maximum likelihood estimate, we treated the true parameter value θ as a fixed (non-random) number. In cases where we have some prior knowledge about θ, it is useful to treat θ itself as a random variable, and express our prior knowledge in the form of a prior probability distribution over θ. For example, suppose that the X1, . . . , Xn are generated in the following way: − First, the value of θ is drawn from a given prior probability distribution − Second, X1, . . . , Xn are drawn independently from a Bernoulli distribution using this value for θ. Since both θ and the sequence X1, . . . , Xn are random, they have a joint probability

  • distribution. In this setting, a natural way to estimate the value of θ is to simply choose

its most probable value given its prior distribution plus the observed data X1, . . . , Xn.

Definition:

ˆ θMAP = argmax

ˆ θ

P(θ = ˆ θ|X1, . . . , Xn). This is called the maximum a posteriori probability (MAP) estimate of θ. 9.

slide-11
SLIDE 11

Reminder (cont’d)

Using Bayes rule, we can rewrite the posterior probability as follows: P(θ = ˆ θ|X1, . . . , Xn) = P(X1, . . . , Xn|θ = ˆ θ)P(θ = ˆ θ) P(X1, . . . , Xn) . Since the probability in the denominator does not depend on ˆ θ, the MAP estimate is given by ˆ θMAP = argmax

ˆ θ

P(X1, . . . , Xn|θ = ˆ θ)P(θ = ˆ θ) = argmax

ˆ θ

L(ˆ θ)P(θ = ˆ θ). In words, the MAP estimate for θ is the value ˆ θ that maximizes the likelihood function multiplied by the prior distribution on θ. 10.

slide-12
SLIDE 12

We will consider a Beta(3, 3) prior distribution for θ, which has the density function given by p(ˆ θ) = ˆ θ2(1 − ˆ θ)2 B(3, 3) , where B(α, β) is the beta function and B(3, 3) ≈ 0.0333. f. Suppose, as in part b, that n = 10 and we observed six 1s and four 0s. Write a short computer program that plots the function ˆ θ → L(ˆ θ) p(ˆ θ) for the same values of ˆ θ as in part b. Estimate ˆ θMAP by marking on the x-axis the value of ˆ θ that maxi- mizes the function. Solution:

0.0005 0.001 0.0015 0.002 0.0025 0.2 0.4 0.6 0.8 1

θ

MAP; n = 10, six 1s, four 0s; Beta(3,3)

11.

slide-13
SLIDE 13
  • g. Find a closed form formula for ˆ

θMAP, the MAP estimate of ˆ θ. Does the closed form agree with the plot? Solution: As in the case of the MLE, we will apply the ln function before finding the maximizer. We want to maximize the function l(ˆ θ) = ln(L(ˆ θ) · p(ˆ θ)) = ln(ˆ θn1+2 · (1 − ˆ θ)n0+2) − ln(B(3, 3)). The normalizing constant for the prior appears as an additive con- stant and therefore the first and second derivatives are identical to those in the case of the MLE (except with n1 + 2 and n0 + 2 instead of n1 and n0, respectively). It follows that the closed form formula for the MAP estimate is given by ˆ θMAP = n1 + 2 n1 + n0 + 4. This formula agrees with the plot obtained in part f.

12.

slide-14
SLIDE 14
  • h. Compare the MAP estimate to the MLE computed from the

same data in part b. Briefly explain any significant difference. Solution: The MAP estimate is equal to the MLE with four additional vir- tual random variables, two that are equal to 1, and two that are equal to 0. This pulls the value of the MAP estimate closer to the value 0.5, which is why ˆ θMAP is smaller than ˆ θMLE.

  • i. Comment on the relationship between the MAP and MLE es-

timates as n goes to infinity, while the ratio #{Xi = 1}/#{Xi = 0} remains constant. Solution: It is obvious that as n goes to infinity, the influence of the 4 vir- tual random variables diminishes, and the two estimators become equal.

13.

slide-15
SLIDE 15

The MLE estimator for the parameter of the Bernoulli distribution: the bias and [an example of] inadmissibility

CMU, 2004 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 1

14.

slide-16
SLIDE 16

Suppose X is a binary random variable that takes value 0 with probability p and value 1 with probability 1 − p. Let X1, . . . , Xn be i.i.d. samples of X.

  • a. Compute an MLE estimate of p (denote it by ˆ

p). Answer: By way of definition, ˆ p = argmax

p

P(X1, . . . , Xn|p)

i.i.d.

= argmax

p

P(Xi|p) = argmax

p

pk(1 − p)n−k where k is the number of 0’s in x1, . . . , xn. Furthermore, since ln is a monotonic (strictly increasing) function, ˆ p = argmax

p

ln pk(1 − p)n−k = argmax

p

(k ln p + (n − k) ln(1 − p)) Computing the first derivative of k ln p + (n − k) ln(1 − p) w.r.t. p leads to: ∂ ∂p(k ln p + (n − k) ln(1 − p)) = k p − n − k 1 − p . Hence, ∂ ∂p(k ln p + (n − k) ln(1 − p)) = 0 ⇔ k p = n − k 1 − p ⇔ ˆ p = k n. 15.

slide-17
SLIDE 17
  • b. Is ˆ

p an unbiased estimate of p? Prove the answer. Answer: Since k can be seen as a sum of n [independent] Bernoulli variables

  • f parameter p, we can write:

E[ˆ p] = E k n

  • = 1

nE[k] = 1 nE[n −

n

  • i=1

Xi] = 1 n(n −

n

  • i=1

E[Xi]) = 1 n(n −

n

  • i=1

(1 − p)) = 1 n(n − n(1 − p)) = 1 nnp = p. Therefore, ˆ p is an unbiased estimator for the parameter p.

16.

slide-18
SLIDE 18
  • c. Compute the expected square error of ˆ

p in terms of p. Answer: E[(ˆ p − p)2] = E[ˆ p2] − 2E[ˆ p] p + p2 = E[k2] n2 − 2p2 + p2 = Var (k) + E2[k] n2 − p2 = np(1 − p) + (np)2 n2 − p2 = p n(1 − p) We used the fact that Var (k) = E[k2] − E2[k], and also Var (k) = np(1 − p) because, as already said, k can be seen as a sum of n independent Bernoulli variables of parameter p. Note that E[ˆ p] = p (cf. part b), therefore E[(ˆ p − p)2] = E[(ˆ p − E[ˆ p])2] = Var [ˆ p] = p n(1 − p). This implies that Var [ˆ p] → 0 for n → ∞.

17.

slide-19
SLIDE 19

d. Prove that if you know that p lies in the interval [1/4; 3/4] and you are given only n = 3 samples of X, then ˆ p is an inadmissible estimator of p when minimizing the expected square error of estimation. Note: An estimator δ of a parameter θ is said to be inadmissible when there exists a different estimator δ′ such that R(θ, δ′) ≤ R(θ, δ) for all θ and R(θ, δ′) < R(θ, δ) for some θ, where R(θ, δ) is a risk function and in this problem it is the expected square error of the estimator. Answer: Consider another estimator, ˜ p = 1/2. E[(˜ p − p)2] = (1/2 − p)2. For p = 1/2 we have E[(˜ p − p)2] = 0 < E[(ˆ p − p)2] = 1/12. We now need to show that E[(˜ p − p)2] ≤ E[(ˆ p − p)2] over p ∈ [1/4; 3/4]. E[(˜ p − p)2] − E[(ˆ p − p)2] = 1 2 − p 2 − 1 3 · p(1 − p) = 1 4 − 4 3p + 4 3p2. This is a parabola going up, so we need to show that it lies below or equal to zero for p ∈ [1/4; 3/4]. It is equivalent to showing that it is below or equal to 0 at boundary points. In fact it is: 1 4 − 4 3p + 4 3p2 = 0 for both p = 1/4 and p = 3/4. 18.

slide-20
SLIDE 20

Estimating the parameters of the categorical distribution: the MLE approach

CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.3

19.

slide-21
SLIDE 21

In this problem we will derive the MLE for the parameters of a categorical distribution where the variable of interest, X, can take on k values, namely a1, a2, . . . , ak, the probability of seeing an event of type j being θj for j = 1, . . . , k. a. Given data describing n independent identically distributed observa- tions of X, namely d1, . . . , dn, each of which can be one of k values, express the likelihood of the data given k − 1 parameters for the distribution over

  • X. Let ni represent the number of times X takes on value i in the data.

Answer: The verosimility of the data is: L(θ) = P(d1, . . . , dn|θ)

i.i.d.

= n

j=1

k

i=1(θiIdj=ai)

(I is the indicator function) = k

i=1 θni i

(ni

not.

= n

j=1 Idj=ai)

= (1 −

k−1

  • i=1

θi

  • θk

)nk k−1

i=1 θni i

(since the thetas sum to one)

20.

slide-22
SLIDE 22
  • b. Find ˆ

θj, the MLE for θj, one of the k − 1 parameters, by setting the partial derivative of the likelihood in part a with respect to θj equal to zero and solving for it. Hint: You may want to start by first taking the log of the likelihood from part a before taking its derivative. Answer: ln L(θ) = nk ln(1 −

k−1

  • i=1

θi) +

k−1

  • i=1

ni ln θi ⇒ ∂ ln L(θ) ∂θj = − nk 1 − k−1

i=1 θi

+ nj θj ∂ ln L(θ) ∂θj = 0 ⇔ − nk 1 − ˆ θj − k−1

i=j,k θi

+ nj ˆ θj = 0 ⇔ nj ˆ θj = nk 1 − ˆ θj − k−1

i=j,k θi

⇔ nj(1 −

k−1

  • i=j,k

θi) = (nk + nj)ˆ θj ⇔ ˆ θj = nj nj + nk (1 −

k−1

  • i=j,k

θi) for all j ∈ {1, . . . , k − 1}.

21.

slide-23
SLIDE 23
  • c. At this point you should have k − 1 equations describing MLEs
  • f different parameters. Show how those equations imply that the

MLE for a parameter θj representing the probability that X takes

  • n value j is equal to nj

n . Hint: In order to remove the k-th parameter from the likelihood in part a you had to represent it with an equation, θk = f( ). At this point you may find it helpful to replace all occurrences of f( ) with θk. After replacing f( ) with θk you can substitute all

  • ccurrences of each other parameter in f( ) with its MLE from

part b. This should allow you to solve for the MLE of θk, which can then be used to simplify all of the other equations.

22.

slide-24
SLIDE 24

Answer As the likelihood function is uniquely optimal for the vector θ, the last equation in part b can be written as: ˆ θj = nj nj + nk

  • 1 −

k−1

  • i=j,k

ˆ θi

  • ⇔ ˆ

θj = nj nj + nk (ˆ θj + ˆ θk) (because ˆ θk = 1 −

k−1

  • i=1

ˆ θi) ⇔ ˆ θj

  • 1 −

nj nj + nk

  • =

nj nj + nk ˆ θk ⇔ ˆ θj nk nj + nk = nj nj + nk ˆ θk ⇔ ˆ θjnk = nj ˆ θk ⇔ ˆ θj = nj nk ˆ θk for all j ∈ {1, . . . , k − 1}.

23.

slide-25
SLIDE 25

Finally, ˆ θk = 1 − ˆ θ1 − . . . − ˆ θk−1 = 1 − n1 nk ˆ θk − . . . − nk−1 nk ˆ θk ⇒ nkˆ θk = nk − (n1 + . . . + nk−1)ˆ θk ⇒ ˆ θk(n1 + . . . + nk−1 + nk

  • n

) = nk ⇒ ˆ θk = nk n ⇒ ˆ θj = nj nk · nk n = nj n for all j ∈ {1, . . . , k − 1}. Note: Even though [here] we can go from the non-hatted to hatted form

  • f the equation in the first step of c, this will generally not be possible. To

solve for a maximum likelihood criterion under additional constraints like the thetas summing to one, a generic and useful method is the method of Lagrange multipliers.

24.

slide-26
SLIDE 26

The Gaussian [uni-variate] distribution: estimating µ when σ2 is known

CMU, 2011 fall, Tom Mitchell, Aarti Singh, HW2, pr. 1 CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 1.2-3

25.

slide-27
SLIDE 27

Assume we have n samples, x1, . . . , xn, independently drawn from a normal distribution with known variance σ2 and unknown mean µ.

  • a. Derive the MLE estimator for the mean µ.

Solution:

P(x1, . . . , xn|µ) =

n

  • i=1

P(xi|µ) =

n

  • i=1

1 √ 2πσ e

−(xi − µ)2

2σ2 ⇒ ln P(x1, . . . , xn|µ) =

n

  • i=1
  • ln

1 √ 2πσ − (xi − µ)2 2σ2

  • ⇒ ∂

∂µP(x1, . . . , xn|µ) =

n

  • i=1

xi − µ σ2 ∂ ∂µP(x1, . . . , xn|µ) = 0 ⇔

n

  • i=1

xi − µ σ2 = 0 ⇔

n

  • i=1

(xi − µ) = 0 ⇔

n

  • i=1

xi = nµ ⇒ µMLE = n

i=1 xi

n Remark: It can be easily shown that ln P(x1, . . . , xn|µ) indeed reaches its maximum for µ = µMLE. 26.

slide-28
SLIDE 28
  • b. Show that E[µMLE] = µ.

Solution: The sample x1, . . . , xn can be seen as the realization of n independent ran- dom variables X1, . . . , Xn of Gaussian distribution of mean µ and variance σ2. Then, due to the property of linearity for the expectation of random variables, we get: E[µMLE] = E X1 + . . . + Xn n

  • = E[X1] + . . . + E[Xn]

n = nµ n = µ Therefore, the µMLE estimator is unbiased.

  • c. What is Var [µMLE]?

Solution: Var [µMLE] = Var

  • 1

n

n

  • i=1

Xi

  • (⋆)

= 1 n2

n

  • i=1

Var [Xi]

i.i.d.

= n 1 n2Var [X1] = σ2 n Therefore, Var [µMLE] → 0 as n → ∞.

(⋆) Remember that Var[aX] = a2Var[X].

27.

slide-29
SLIDE 29
  • d. Now derive the MAP estimator for the mean µ. Assume that

the prior distribution for the mean is itself a normal distribution with mean ν and variance β2.

28.

slide-30
SLIDE 30

Solution 1:

P(µ|x1, . . . , xn)

  • T. Bayes

= P(x1, . . . , xn|µ) P(µ) P(x1, . . . , xn) (1) =    n

i=1

1 √ 2πσ e

−(xi − µ)2

2σ2    · 1 √ 2πβ e

−(µ − ν)2

2β2 C (2) where C

not.

= P(x1, . . . , xn). ⇒ ln P(µ|x1, . . . , xn) = −

n

  • i=1
  • ln

√ 2πσ + (xi − µ)2 2σ2

  • − ln

√ 2πβ − (µ − ν)2 2β2 − ln C ⇒ ∂ ∂µ ln P(µ|x1, . . . , xn) =

n

  • i=1

xi − µ σ2 − µ − ν β2 ∂ ∂µ ln P(µ|x1, . . . , xn) = 0 ⇔

n

  • i=1

xi − µ σ2 = µ − ν β2 ⇔ µ 1 β2 + n σ2

  • =

n

i=1 xi

σ2 + ν β2 ⇒ µMAP = σ2ν + β2 n

i=1 xi

σ2 + nβ2 29.

slide-31
SLIDE 31

Solution 2:

Instead

  • f

computing the derivative

  • f

the posterior distribution P(µ|x1, . . . , xn), we will first show that the right hand side of (2) is itself a Gaussian, and then we will use the fact that the mean of a Gaussian is where it achieves its maximum value. P(µ|x1, . . . , xn) = 1 C   

n

  • i=1

1 √ 2πσ e

−(xi − µ)2

2σ2    · 1 √ 2πβ e

−(µ − ν)2

2β2 = const · e

n

i=1(xi − µ)2

2σ2

−(µ − ν)2

2β2 = const · e

−β2 n i=1(xi − µ)2 + σ2(µ − ν)2

2σ2β2 = const · e

−nβ2 + σ2

2σ2β2

µ2+β2 n i=1 xi + νσ2

σ2β2

µ−β2 n i=1 x2 i + ν2σ2

2σ2β2 30.

slide-32
SLIDE 32

P(µ|x1, . . . , xn) = = const · exp    − µ2 − 2µβ2 n

i=1 xi + νσ2

nβ2 + σ2 + β2 n

i=1 x2 i + ν2σ2

nβ2 + σ2 2σ2β2 nβ2 + σ2     = const · exp     − (µ − β2 n

i=1 xi + νσ2

nβ2 + σ2 )2 − β2 n

i=1 xi + νσ2

nβ2 + σ2 2 + β2 n

i=1 x2 i + ν2σ2

nβ2 + σ2 2 σ2β2 nβ2 + σ2      = const · exp     −

  • µ − β2 n

i=1 xi + νσ2

nβ2 + σ2 2 2 σ2β2 nβ2 + σ2      · exp      β2 n

i=1 xi + νσ2

nβ2 + σ2 2 − β2 n

i=1 x2 i + ν2σ2

nβ2 + σ2 2 σ2β2 nβ2 + σ2      = const′ · exp     −

  • µ − β2 n

i=1 xi + νσ2

nβ2 + σ2 2 2 σ2β2 nβ2 + σ2      31.

slide-33
SLIDE 33

The exp term in the last equality being a Gaussian of mean β2 n

i=1 xi + νσ2

nβ2 + σ2 and variance σ2β2 nβ2 + σ2, it follows that its maximum is obtained for µ = β2 n

i=1 xi + νσ2

nβ2 + σ2 = µMAP.

32.

slide-34
SLIDE 34
  • e. Please comment on what happens to the MLE and MAP esti-

mators for the mean µ as the number of samples n goes to infinity. Solution:

µMLE = n

i=1 xi

n µMAP = σ2ν + β2 n

i=1 xi

σ2 + nβ2 = σ2ν σ2 + nβ2 + β2 n

i=1 xi

σ2 + nβ2 = σ2ν σ2 + nβ2 + 1 n n

i=1 xi

1 + σ2 nβ2 = σ2ν σ2 + nβ2 + µMLE 1 + σ2 nβ2 n → ∞ ⇒ σ2ν σ2 + nβ2 → 0 and σ2 nβ2 → 0 ⇒ µMAP → µMLE 33.

slide-35
SLIDE 35

The Gaussian [uni-variate] distribution: estimating σ2 when µ = 0

CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.1

34.

slide-36
SLIDE 36

Let X be a random variable distributed according to a Normal distribution with 0 mean, and σ2 variance, i.e. X ∼ N(0, σ2).

  • a. Find the maximum likelihood estimate for σ2, i.e. σ2

MLE.

Solution:

Let X1, X2, . . . , Xn be drawn i.i.d. ∼ N(0, σ2). Let f be the density function corresponding to X. Then we can write the likelihood function as: L(X1, X2, . . . , Xn|σ2) =

n

  • i=1

f(Xi; µ = 0, σ2) =

  • 1

√ 2πσ n

n

  • i=1

exp

  • −(Xi − 0)2

2σ2

  • =
  • 1

√ 2πσ n exp

n

i=1 X2 i

2σ2

  • ⇒ ln L

= constant − n 2 ln σ2 − 1 2σ2

n

  • i=1

X2

i

⇒ ∂ ln L ∂σ2 = − n 2σ2 + 1 2σ4

n

  • i=1

X2

i . Therefore, ∂ ln L

∂σ2 = 0 ⇔ σ2

MLE = 1

n

n

  • i=1

X2

i

Note:

It can be easily shown that L(X1, X2, . . . , Xn|σ2) indeed reaches its maximum for σ2 = 1 n n

i=1 X2 i .

35.

slide-37
SLIDE 37
  • b. Is the estimator you obtained biased?

Solution: It is unbiased, since: E[ 1

n

n

i=1 X2 i ] = n

nE[X2] since i.i.d. = Var [X] + (E[X])2 = Var [X] = σ2 since E[X] = 0

36.

slide-38
SLIDE 38

The Gaussian [uni-variate] distribution: estimating σ2 (without restrictions on µ)

CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 2.1.1-2

37.

slide-39
SLIDE 39

Let x

not.

= (x1, . . . , xn) be observed i.i.d. samples from a Gaussian distribution N(x|µ, σ2).

  • a. Derive σ2

MLE, the MLE for σ2.

Solution:

The p.d.f. for N(x|µ, σ2) has the form f(x) = 1 √ 2πσ e

−(x − µ)2

2σ2 . The log likelihood function of the data x is: ln L(x | µ, σ2) = ln

n

  • i=1

f(xi) =

n

  • i=1
  • −1

2 ln(2πσ2) − (xi − µ)2 2σ2

  • =

−n 2 ln(2πσ2) − 1 2σ2

n

  • i=1

(xi − µ)2 The partial derivative of ln L w.r.t. σ2: ∂ ln L(x | µ, σ2) ∂σ2 = − n 2σ2 + 1 2σ4 n

i=1(xi − µ)2.

Solving the equation ∂ ln L(x | µ, σ2) ∂σ2 = 0, we get: σ2

MLE = 1

n n

i=1(xi − µMLE)2.

Note that we had to take into account the optimal value of µ (see problem CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1) 38.

slide-40
SLIDE 40
  • b. Show that E[σ2

MLE] = n − 1

n σ2. Solution:

E[σ2

MLE]

= E

  • 1

n

n

  • i=1

(xi − µMLE)2

  • = E[(x1 − µMLE)2] = E
  • (x1 − 1

n

n

  • i=1

xi)2

  • =

E

  • x2

1 − 2

nx1

n

  • i=1

xi + 1 n2 (

n

  • i=1

xi)2

  • =

E  x2

1 − 2

nx1

n

  • i=1

xi + 1 n2

n

  • i=1

x2

i + 2

n2

  • i<j

xixj   = E[x2

1] + 1

n2

n

  • i=1

E[x2

i ] − 2

n

n

  • i=1

E[x1xi] + 2 n2

  • i<j

E[xixj] = E[x2

1] + 1

n2 nE[x2

1] − 2

nE[x2

1] − 2

n(n − 1)E[x1x2] + 2 n2 n(n − 1) 2 E[x1x2] = n − 1 n E[x2

1] − n − 1

n E[x1x2] 39.

slide-41
SLIDE 41

σ2 = Var(x1) = E[x2

1] − (E[x1])2 = E[x2 1] − µ2 ⇒ E[x2 1] = σ2 + µ2

Because x1 and x2 are independent, it follows that Cov(x1, x2) = 0. Therefore, = Cov(x1, x2) = E[(x1 − E[x1])(x2 − E[x2])] = E[(x1 − µ)(x2 − µ)] = E[x1x2] − µE[x1 + x2] + µ2 = E[x1x2] − µ(E[x1] + E[x2]) + µ2 = E[x1x2] − µ(2µ) + µ2 = E[x1x2] − µ2 So, E[x1x2] = µ2. By substituting E[x2

1] = σ2 + µ2 and E[x1x2] = µ2 into the previously obtained

equality (E[σMLE] = n − 1 n E[x2

1] − n − 1

n E[x1x2]), we get: E[σ2

MLE] = n − 1

n (σ2 + µ2) − n − 1 n µ2 = n − 1 n σ2 40.

slide-42
SLIDE 42
  • c. Find an unbiased estimator for σ2.

Solution: It can be immediately proven that 1 n − 1 n

i=1(xi − µMLE)2 is an un-

biased estimator of σ2.

41.

slide-43
SLIDE 43

The Gaussian multi-variate distribution: ML estimation of the mean and the precision matrix, Λ

(Λ is the inverse of the covariance matrix, Σ)

CMU, 2010 fall, Aarti Singh, HW1, pr. 3.2.1

42.

slide-44
SLIDE 44

The density function of a d-dimensional Gaussian distribution is as follows: N (x | µ, Λ−1) = exp

  • −1

2(x − µ)⊤Λ(x − µ)

  • (2π)d/2

|Λ−1| , where Λ is the inverse of the covariance matrix, or the so-called precision matrix. Let {x1, x2, . . . , xn} be an i.i.d. sample from a d-dimensional Gaussian distribution. Suppose that n ≫ d. Derive the MLE estimates ˆ µ and ˆ Λ.

43.

slide-45
SLIDE 45

Hint

You may find useful the following formulas (taken from Matrix Identities, by Sam Roweis, 1999): (2b) |A−1| = 1 |A| (2e) Tr(AB) = Tr(BA);a more generally, Tr(ABC . . .) = Tr(BC . . . A) = Tr(C . . . AB) = . . . (3b) ∂ ∂X Tr(XA) = ∂ ∂X Tr(AX) = A⊤ (4b) ∂ ∂X ln |X| = (X−1)⊤ = (X⊤)−1 (5c) ∂ ∂X a⊤Xb = ab⊤ (5g) ∂ ∂X (Xa + b)⊤C(Xa + b) = (C + C⊤)(Xa + b)a⊤ Tr(A), the trace of an n-by-n square matrix A, is defined as the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of A, i.e., a11 + . . . + ann.

aSee Theorem 1.3.d from Matrix Analysis for Statistics, 2017, James R. Schott.

44.

slide-46
SLIDE 46

Given the x1, . . . , xn data, the log-likelihood function is: l(µ, Λ)

i.i.d.

= ln

n

  • i=1

N(xi|µ, Λ−1) =

n

  • i=1

ln N(xi|µ, Λ−1) = −nd 2 ln(2π) − n 2 ln |Λ−1| − 1 2

n

  • i=1

(xi − µ)⊤Λ(xi − µ)

(2b)

= −nd 2 ln(2π) + n 2 ln |Λ| − 1 2

n

  • i=1

(xi − µ)⊤Λ(xi − µ). For any fixed positive definite precision matrix Λ, the log-likelihood is a quadratic function of µ with a negative leading coefficient, hence a strictly concave function of µ. We then solve ∇µl(µ, Λ) = 0

(5g)

⇐ ⇒ −1 2(Λ + Λ⊤)

n

  • i=1

(µ − xi)(−1) = 0 ⇐ ⇒ Λ

n

  • i=1

(xi − µ) = 0 ⇐ ⇒ Λ

n

  • i=1

xi = nΛµ. From () we get, by the assumption that Λ is invertible, the following estimate of µ: ˆ µ = n

i=1 xi

n , which coincides with the sample mean ¯ x and is constant w.r.t. Λ. 45.

slide-47
SLIDE 47

Now that we have l(µ, Λ) ≤ l(ˆ µ, Λ) ∀µ ∈ Rd, Λ being positive definite, we continue to consider Λ by first plugging ˆ µ back in the log-likelihood function (3): l(ˆ µ, Λ) = −nd 2 ln(2π) + n 2 ln |Λ| − 1 2

n

  • i=1

(xi − ¯ x)⊤Λ(xi − ¯ x) (3)

(2e)

= −nd 2 ln(2π) + n 2 (ln |Λ| − Tr(SΛ)), (4) where S is the sample covariance matrix: S = 1 n n

i=1(xi − ¯

x)(xi − ¯ x)⊤. Explanation: (xi − ¯ x)⊤Λ(xi − ¯ x) is a 1-by-1 matrix, therefore (xi − ¯ x)⊤Λ(xi − ¯ x) = Tr((xi − ¯ x)⊤Λ(xi − ¯ x)), and using the (2e) rule, it can be further written as Tr((xi − ¯ x)(xi − ¯ x)⊤Λ). Using another simple rule, Tr(A + B) = Tr(A) + Tr(B) (which can be easily proven), it follows that

n

  • i=1

(xi − ¯ x)⊤Λ(xi − ¯ x) =

n

  • i=1

Tr((xi − ¯ x)(xi − ¯ x)⊤Λ) = Tr(

n

  • i=1

((xi − ¯ x)(xi − ¯ x)⊤Λ)) = Tr((

n

  • i=1

(xi − ¯ x)(xi − ¯ x)⊤)Λ) = Tr((nS)Λ) = nTr(SΛ). 46.

slide-48
SLIDE 48

By the fact that ln |Λ| is strictly concave on the domain of positive definite Λ,a and that Tr(SΛ) is linear in Λ, we are able to find the maximum of the expression (4) by solving ∇Λl(ˆ µ, Λ) = 0, which can be provenb to be equivalent to Λ−1 − S = 0. (Therefore, ˆ Σ = ˆ Λ−1 = S.) Since n ≫ d, we can safely assume that S is invertible and get the following estimate: ˆ Λ = S−1.

aSee, for example, Section 3.1.5, Convex Optimization: http://www.stanford.edu/∼boyd/cvxbook/. bApplying (4b) and (3b) on (4) you’ll get (Λ⊤)−1 − S⊤. Then (Λ⊤)−1 − S⊤ = 0 ⇔ (Λ−1)⊤ = S⊤ ⇔ Λ−1 = S.

47.

slide-49
SLIDE 49

Notes

  • 1. In the above derivation, we have ensured that the estimates µ and ˆ

Λ are in the parameter space and satisfy l(µ, Λ) ≤ l(ˆ µ, Λ) ≤ l(ˆ µ, ˆ Λ) ∀µ ∈ Rd, Λ being positive definite, so they are the MLE estimates.

  • 2. Instead of using the relation (4), i.e., working with the Tr functional,
  • ne could directly compute the partial derivative of l(ˆ

µ, Λ) in (3): ∇Λl(ˆ µ, Λ)

(4b),(5c)

= n 2(Λ⊤)−1 − 1 2

n

  • i=1

(xi − µ)(xi − µ)⊤ = n 2Λ−1 − 1 2

n

  • i=1

(xi − µ)(xi − µ)⊤ = n 2Λ−1 − n 2S. So, ∇Λl(ˆ µ, Λ) = 0 ⇔ ˆ Λ−1 not. = ˆ Σ = S

not.

= 1 n

n

  • i=1

(xi − µ)(xi − µ)⊤.

48.

slide-50
SLIDE 50

The Gamma distribution: Maximum Likelihood Estimation of parameters

Liviu Ciortuz, 2017

49.

slide-51
SLIDE 51

The Gamma distribution of parameters r > 0 and β > 0 has the following density function: Gamma(x|r, β) = 1 βrΓ(r)xr−1e

−x

β , for all x > 0, where the Γ symbol designates Euler’s Gamma function.

5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

x p(x)

5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

r = 1.0, β = 2.0 r = 2.0, β = 2.0 r = 3.0, β = 2.0 r = 5.0, β = 1.0 r = 9.0, β = 0.5 r = 7.5, β = 1.0 r = 0.5, β = 1.0

Notes:

  • 1. In the above definition,

1 βrΓ(r) is the so-called normalization factor, since it does not depend on x, and +∞

x=−∞ xr−1e −x

β dx = βr Γ(r). 2. Euler’s Gamma function is defined as follows: Γ(r) = +∞ tr−1e−tdt, for all r ∈ R, except for the negative integers. Starting from the definition of Γ, it can be easily shown that Γ(r +1) = rΓ(r) for any r > 0, and also Γ(1) = 1. Therefore, Γ(r +1) = r ·Γ(n) = r·(r−1)·Γ(r−1) = . . . = r·(r−1)·. . .·2·1 = r!, which means that the Γ function generalizes the factorial function. 3. The exponential distribution is a member of the Gamma family of distributions. (Just set r to 1 in Gamma’s density function.) 50.

slide-52
SLIDE 52

Consider x1, . . . , xn ∈ R+, all of them [having been] generated by one compo- nent of the above family of distributions. Find the maximum likelihood estimation of the parameters r and β.

Solution

  • The verosimility function:

L(r, β)

def.

= P(x1, . . . , xn|r, β)

i.i.d.

=

n

  • i=1

P(xi|r, β) = β−rn(Γ(r))−n n

  • i=1

xi r−1 e

− 1

β

n

i=1 xi

  • The log-verosimility function:

ℓ(r, β)

def.

= ln L(r, β) = −rn ln β − n ln Γ(r) + (r − 1)

n

  • i=1

ln xi − 1 β

n

  • i=1

xi. 51.

slide-53
SLIDE 53

Now we will calculate the partial derivative of ℓ(r, β) w.r.t. β, and then equate it to 0: ∂ ∂β ℓ(r, β) = −rn β + 1 β2

n

  • i=1

xi = 1 β2 n

  • i=1

xi − rnβ

∂β ℓ(r, β) = 0 ⇔ ˆ β = 1 rn

n

  • i=1

xi > 0. By substituting ˆ β into ℓ(r, β), we will get: ℓ(r, ˆ β) = −rn ln ˆ β − n ln Γ(r) + (r − 1)

n

  • i=1

ln xi − 1 ˆ β

n

  • i=1

xi = rn ln(rn) − rn ln

n

  • i=1

xi − n ln Γ(r) + (r − 1)

n

  • i=1

ln xi − rn n

i=1 xi

·

n

  • i=1

xi = rn ln(rn) − rn

  • ln

n

  • i=1

xi + 1

  • − n ln Γ(r) + (r − 1)

n

  • i=1

ln xi 52.

slide-54
SLIDE 54

Therefore, by computing the partial derivative of ℓ(r, ˆ β) with respect to r, and then equating this derivative to 0, we will get: ∂ ∂r ℓ(r, ˆ β) = 0 ⇔ n ln(nr) + n − n

  • ln

n

  • i=1

xi + 1

  • − n · Γ′(r)

Γ(r) +

n

  • i=1

ln xi = 0 ⇔ n(ln r − ψ(r)) = −n ln n −

n

  • i=1

ln xi + n ln

n

  • i=1

xi ⇔ ln r − ψ(r) = − ln n − 1 n

n

  • i=1

ln xi + ln

n

  • i=1

xi. The solution of the last equation is ˆ r, the maximum likelihood estimation of the parameter r. 53.