Estimating the parameters of some probability distributions: Exemplifications
0.
Estimating the parameters of some probability distributions: - - PowerPoint PPT Presentation
0. Estimating the parameters of some probability distributions: Exemplifications 1. Estimating the parameter of the Bernoulli distribution: the MLE and MAP approaches CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2 2. Suppose we
0.
1.
Suppose we observe the values of n i.i.d. (independent, identi- cally distributed) random variables X1, . . . , Xn drawn from a single Bernoulli distribution with parameter θ. In other words, for each Xi, we know that P(Xi = 1) = θ and P(Xi = 0) = 1 − θ. Our goal is to estimate the value of θ from the observed values of X1, . . . , Xn.
2.
For any hypothetical value ˆ θ, we can compute the probability of
were equal to ˆ θ. This probability of the observed data is often called the data likeli- hood, and the function L(ˆ θ) that maps each ˆ θ to the corresponding likelihood is called the likelihood function. A natural way to estimate the unknown parameter θ is to choose the ˆ θ that maximizes the likelihood function. Formally, ˆ θMLE = argmax
ˆ θ
L(ˆ θ).
3.
θ). Your function should depend on the random variables X1, . . . , Xn and the hypothetical parameter ˆ θ. Does the likelihood function depend on the order of the random variables? Solution: Since the Xi are independent, we have L(ˆ θ) = Pˆ
θ(X1, . . . , Xn) = n
Pˆ
θ(Xi) = n
(ˆ θXi · (1 − ˆ θ)1−Xi) = ˆ θ#{Xi=1} · (1 − ˆ θ)#{Xi=0}, where #{·} counts the number of Xi for which the condition in braces holds true. Note that in the third equality we used the trick Xi = I{Xi=1}. The likelihood function does not depend on the order of the data.
4.
set contains six 1s and four 0s. Write a short computer program that plots the likelihood function of this data. For the plot, the x-axis should be ˆ θ, and the y-axis L(ˆ θ). Scale your y-axis so that you can see some variation in its value. Estimate ˆ θMLE by marking on the x- axis the value of ˆ θ that maximizes the likelihood. Solution:
0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.2 0.4 0.6 0.8 1
MLE; n = 10, six 1s, four 0s 5.
θMLE, the MLE estimate of ˆ θ. Does the closed form agree with the plot? Solution: Let’s consider l(θ) = ln(L(θ)). Since the ln function is increasing, the ˆ θ that maximizes the log-likelihood is the same as the ˆ θ that maximizes the
θ) as follows: l(ˆ θ) = ln(ˆ θn1 · (1 − ˆ θ)n0) = n1 ln(ˆ θ) + n0 ln(1 − ˆ θ). where n1
not.
= #{Xi = 1}, iar n0
not.
= #{Xi = 0}. Assuming that ˆ θ = 0 and ˆ θ = 1, the first and second derivatives of l are given by l′(ˆ θ) = n1 ˆ θ − n0 1 − ˆ θ and l′′(ˆ θ) = −n1 ˆ θ2 − n0 (1 − ˆ θ)2 Since l′′(ˆ θ) is always negative, the l function is concave, and we can find its maximizer by solving the equation l′(θ) = 0. The solution to this equation is given by ˆ θMLE = n1 n1 + n0 .
6.
data set contains three 1s and two 0s; one where n = 100 and the data set contains sixty 1s and fourty 0s; and one where n = 10 and there are five 1s and five 0s. Solution:
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.2 0.4 0.6 0.8 1
MLE; n = 5, three 1s, two 0s 1e-30 2e-30 3e-30 4e-30 5e-30 6e-30 7e-30 0.2 0.4 0.6 0.8 1
MLE; n = 100, sixty 1s, fourty 0s
7.
Solution (to part d.):
0.0002 0.0004 0.0006 0.0008 0.001 0.2 0.4 0.6 0.8 1
MLE; n = 10, five 1s, five 0s
e. Describe how the likelihood functions and maximum likelihood estimates compare for the different data sets. Solution (to part e.): The MLE is equal to the propor- tion of 1s observed in the data, so for the first three plots the MLE is always at 0.6, while for the last plot it is at 0.5. As the number of samples n in- creases, the likelihood function gets more peaked at its maximum value, and the values it takes on decrease.
8.
Reminder: Maximum a Posteriori Probability Estimation
In the maximum likelihood estimate, we treated the true parameter value θ as a fixed (non-random) number. In cases where we have some prior knowledge about θ, it is useful to treat θ itself as a random variable, and express our prior knowledge in the form of a prior probability distribution over θ. For example, suppose that the X1, . . . , Xn are generated in the following way: − First, the value of θ is drawn from a given prior probability distribution − Second, X1, . . . , Xn are drawn independently from a Bernoulli distribution using this value for θ. Since both θ and the sequence X1, . . . , Xn are random, they have a joint probability
its most probable value given its prior distribution plus the observed data X1, . . . , Xn.
Definition:
ˆ θMAP = argmax
ˆ θ
P(θ = ˆ θ|X1, . . . , Xn). This is called the maximum a posteriori probability (MAP) estimate of θ. 9.
Reminder (cont’d)
Using Bayes rule, we can rewrite the posterior probability as follows: P(θ = ˆ θ|X1, . . . , Xn) = P(X1, . . . , Xn|θ = ˆ θ)P(θ = ˆ θ) P(X1, . . . , Xn) . Since the probability in the denominator does not depend on ˆ θ, the MAP estimate is given by ˆ θMAP = argmax
ˆ θ
P(X1, . . . , Xn|θ = ˆ θ)P(θ = ˆ θ) = argmax
ˆ θ
L(ˆ θ)P(θ = ˆ θ). In words, the MAP estimate for θ is the value ˆ θ that maximizes the likelihood function multiplied by the prior distribution on θ. 10.
We will consider a Beta(3, 3) prior distribution for θ, which has the density function given by p(ˆ θ) = ˆ θ2(1 − ˆ θ)2 B(3, 3) , where B(α, β) is the beta function and B(3, 3) ≈ 0.0333. f. Suppose, as in part b, that n = 10 and we observed six 1s and four 0s. Write a short computer program that plots the function ˆ θ → L(ˆ θ) p(ˆ θ) for the same values of ˆ θ as in part b. Estimate ˆ θMAP by marking on the x-axis the value of ˆ θ that maxi- mizes the function. Solution:
0.0005 0.001 0.0015 0.002 0.0025 0.2 0.4 0.6 0.8 1
θ
MAP; n = 10, six 1s, four 0s; Beta(3,3)
11.
θMAP, the MAP estimate of ˆ θ. Does the closed form agree with the plot? Solution: As in the case of the MLE, we will apply the ln function before finding the maximizer. We want to maximize the function l(ˆ θ) = ln(L(ˆ θ) · p(ˆ θ)) = ln(ˆ θn1+2 · (1 − ˆ θ)n0+2) − ln(B(3, 3)). The normalizing constant for the prior appears as an additive con- stant and therefore the first and second derivatives are identical to those in the case of the MLE (except with n1 + 2 and n0 + 2 instead of n1 and n0, respectively). It follows that the closed form formula for the MAP estimate is given by ˆ θMAP = n1 + 2 n1 + n0 + 4. This formula agrees with the plot obtained in part f.
12.
same data in part b. Briefly explain any significant difference. Solution: The MAP estimate is equal to the MLE with four additional vir- tual random variables, two that are equal to 1, and two that are equal to 0. This pulls the value of the MAP estimate closer to the value 0.5, which is why ˆ θMAP is smaller than ˆ θMLE.
timates as n goes to infinity, while the ratio #{Xi = 1}/#{Xi = 0} remains constant. Solution: It is obvious that as n goes to infinity, the influence of the 4 vir- tual random variables diminishes, and the two estimators become equal.
13.
14.
Suppose X is a binary random variable that takes value 0 with probability p and value 1 with probability 1 − p. Let X1, . . . , Xn be i.i.d. samples of X.
p). Answer: By way of definition, ˆ p = argmax
p
P(X1, . . . , Xn|p)
i.i.d.
= argmax
p
P(Xi|p) = argmax
p
pk(1 − p)n−k where k is the number of 0’s in x1, . . . , xn. Furthermore, since ln is a monotonic (strictly increasing) function, ˆ p = argmax
p
ln pk(1 − p)n−k = argmax
p
(k ln p + (n − k) ln(1 − p)) Computing the first derivative of k ln p + (n − k) ln(1 − p) w.r.t. p leads to: ∂ ∂p(k ln p + (n − k) ln(1 − p)) = k p − n − k 1 − p . Hence, ∂ ∂p(k ln p + (n − k) ln(1 − p)) = 0 ⇔ k p = n − k 1 − p ⇔ ˆ p = k n. 15.
p an unbiased estimate of p? Prove the answer. Answer: Since k can be seen as a sum of n [independent] Bernoulli variables
E[ˆ p] = E k n
nE[k] = 1 nE[n −
n
Xi] = 1 n(n −
n
E[Xi]) = 1 n(n −
n
(1 − p)) = 1 n(n − n(1 − p)) = 1 nnp = p. Therefore, ˆ p is an unbiased estimator for the parameter p.
16.
p in terms of p. Answer: E[(ˆ p − p)2] = E[ˆ p2] − 2E[ˆ p] p + p2 = E[k2] n2 − 2p2 + p2 = Var (k) + E2[k] n2 − p2 = np(1 − p) + (np)2 n2 − p2 = p n(1 − p) We used the fact that Var (k) = E[k2] − E2[k], and also Var (k) = np(1 − p) because, as already said, k can be seen as a sum of n independent Bernoulli variables of parameter p. Note that E[ˆ p] = p (cf. part b), therefore E[(ˆ p − p)2] = E[(ˆ p − E[ˆ p])2] = Var [ˆ p] = p n(1 − p). This implies that Var [ˆ p] → 0 for n → ∞.
17.
d. Prove that if you know that p lies in the interval [1/4; 3/4] and you are given only n = 3 samples of X, then ˆ p is an inadmissible estimator of p when minimizing the expected square error of estimation. Note: An estimator δ of a parameter θ is said to be inadmissible when there exists a different estimator δ′ such that R(θ, δ′) ≤ R(θ, δ) for all θ and R(θ, δ′) < R(θ, δ) for some θ, where R(θ, δ) is a risk function and in this problem it is the expected square error of the estimator. Answer: Consider another estimator, ˜ p = 1/2. E[(˜ p − p)2] = (1/2 − p)2. For p = 1/2 we have E[(˜ p − p)2] = 0 < E[(ˆ p − p)2] = 1/12. We now need to show that E[(˜ p − p)2] ≤ E[(ˆ p − p)2] over p ∈ [1/4; 3/4]. E[(˜ p − p)2] − E[(ˆ p − p)2] = 1 2 − p 2 − 1 3 · p(1 − p) = 1 4 − 4 3p + 4 3p2. This is a parabola going up, so we need to show that it lies below or equal to zero for p ∈ [1/4; 3/4]. It is equivalent to showing that it is below or equal to 0 at boundary points. In fact it is: 1 4 − 4 3p + 4 3p2 = 0 for both p = 1/4 and p = 3/4. 18.
19.
In this problem we will derive the MLE for the parameters of a categorical distribution where the variable of interest, X, can take on k values, namely a1, a2, . . . , ak, the probability of seeing an event of type j being θj for j = 1, . . . , k. a. Given data describing n independent identically distributed observa- tions of X, namely d1, . . . , dn, each of which can be one of k values, express the likelihood of the data given k − 1 parameters for the distribution over
Answer: The verosimility of the data is: L(θ) = P(d1, . . . , dn|θ)
i.i.d.
= n
j=1
k
i=1(θiIdj=ai)
(I is the indicator function) = k
i=1 θni i
(ni
not.
= n
j=1 Idj=ai)
= (1 −
k−1
θi
)nk k−1
i=1 θni i
(since the thetas sum to one)
20.
θj, the MLE for θj, one of the k − 1 parameters, by setting the partial derivative of the likelihood in part a with respect to θj equal to zero and solving for it. Hint: You may want to start by first taking the log of the likelihood from part a before taking its derivative. Answer: ln L(θ) = nk ln(1 −
k−1
θi) +
k−1
ni ln θi ⇒ ∂ ln L(θ) ∂θj = − nk 1 − k−1
i=1 θi
+ nj θj ∂ ln L(θ) ∂θj = 0 ⇔ − nk 1 − ˆ θj − k−1
i=j,k θi
+ nj ˆ θj = 0 ⇔ nj ˆ θj = nk 1 − ˆ θj − k−1
i=j,k θi
⇔ nj(1 −
k−1
θi) = (nk + nj)ˆ θj ⇔ ˆ θj = nj nj + nk (1 −
k−1
θi) for all j ∈ {1, . . . , k − 1}.
21.
MLE for a parameter θj representing the probability that X takes
n . Hint: In order to remove the k-th parameter from the likelihood in part a you had to represent it with an equation, θk = f( ). At this point you may find it helpful to replace all occurrences of f( ) with θk. After replacing f( ) with θk you can substitute all
part b. This should allow you to solve for the MLE of θk, which can then be used to simplify all of the other equations.
22.
Answer As the likelihood function is uniquely optimal for the vector θ, the last equation in part b can be written as: ˆ θj = nj nj + nk
k−1
ˆ θi
θj = nj nj + nk (ˆ θj + ˆ θk) (because ˆ θk = 1 −
k−1
ˆ θi) ⇔ ˆ θj
nj nj + nk
nj nj + nk ˆ θk ⇔ ˆ θj nk nj + nk = nj nj + nk ˆ θk ⇔ ˆ θjnk = nj ˆ θk ⇔ ˆ θj = nj nk ˆ θk for all j ∈ {1, . . . , k − 1}.
23.
Finally, ˆ θk = 1 − ˆ θ1 − . . . − ˆ θk−1 = 1 − n1 nk ˆ θk − . . . − nk−1 nk ˆ θk ⇒ nkˆ θk = nk − (n1 + . . . + nk−1)ˆ θk ⇒ ˆ θk(n1 + . . . + nk−1 + nk
) = nk ⇒ ˆ θk = nk n ⇒ ˆ θj = nj nk · nk n = nj n for all j ∈ {1, . . . , k − 1}. Note: Even though [here] we can go from the non-hatted to hatted form
solve for a maximum likelihood criterion under additional constraints like the thetas summing to one, a generic and useful method is the method of Lagrange multipliers.
24.
25.
Assume we have n samples, x1, . . . , xn, independently drawn from a normal distribution with known variance σ2 and unknown mean µ.
Solution:
P(x1, . . . , xn|µ) =
n
P(xi|µ) =
n
1 √ 2πσ e
−(xi − µ)2
2σ2 ⇒ ln P(x1, . . . , xn|µ) =
n
1 √ 2πσ − (xi − µ)2 2σ2
∂µP(x1, . . . , xn|µ) =
n
xi − µ σ2 ∂ ∂µP(x1, . . . , xn|µ) = 0 ⇔
n
xi − µ σ2 = 0 ⇔
n
(xi − µ) = 0 ⇔
n
xi = nµ ⇒ µMLE = n
i=1 xi
n Remark: It can be easily shown that ln P(x1, . . . , xn|µ) indeed reaches its maximum for µ = µMLE. 26.
Solution: The sample x1, . . . , xn can be seen as the realization of n independent ran- dom variables X1, . . . , Xn of Gaussian distribution of mean µ and variance σ2. Then, due to the property of linearity for the expectation of random variables, we get: E[µMLE] = E X1 + . . . + Xn n
n = nµ n = µ Therefore, the µMLE estimator is unbiased.
Solution: Var [µMLE] = Var
n
n
Xi
= 1 n2
n
Var [Xi]
i.i.d.
= n 1 n2Var [X1] = σ2 n Therefore, Var [µMLE] → 0 as n → ∞.
(⋆) Remember that Var[aX] = a2Var[X].
27.
the prior distribution for the mean is itself a normal distribution with mean ν and variance β2.
28.
Solution 1:
P(µ|x1, . . . , xn)
= P(x1, . . . , xn|µ) P(µ) P(x1, . . . , xn) (1) = n
i=1
1 √ 2πσ e
−(xi − µ)2
2σ2 · 1 √ 2πβ e
−(µ − ν)2
2β2 C (2) where C
not.
= P(x1, . . . , xn). ⇒ ln P(µ|x1, . . . , xn) = −
n
√ 2πσ + (xi − µ)2 2σ2
√ 2πβ − (µ − ν)2 2β2 − ln C ⇒ ∂ ∂µ ln P(µ|x1, . . . , xn) =
n
xi − µ σ2 − µ − ν β2 ∂ ∂µ ln P(µ|x1, . . . , xn) = 0 ⇔
n
xi − µ σ2 = µ − ν β2 ⇔ µ 1 β2 + n σ2
n
i=1 xi
σ2 + ν β2 ⇒ µMAP = σ2ν + β2 n
i=1 xi
σ2 + nβ2 29.
Solution 2:
Instead
computing the derivative
the posterior distribution P(µ|x1, . . . , xn), we will first show that the right hand side of (2) is itself a Gaussian, and then we will use the fact that the mean of a Gaussian is where it achieves its maximum value. P(µ|x1, . . . , xn) = 1 C
n
1 √ 2πσ e
−(xi − µ)2
2σ2 · 1 √ 2πβ e
−(µ − ν)2
2β2 = const · e
−
n
i=1(xi − µ)2
2σ2
−(µ − ν)2
2β2 = const · e
−β2 n i=1(xi − µ)2 + σ2(µ − ν)2
2σ2β2 = const · e
−nβ2 + σ2
2σ2β2
µ2+β2 n i=1 xi + νσ2
σ2β2
µ−β2 n i=1 x2 i + ν2σ2
2σ2β2 30.
P(µ|x1, . . . , xn) = = const · exp − µ2 − 2µβ2 n
i=1 xi + νσ2
nβ2 + σ2 + β2 n
i=1 x2 i + ν2σ2
nβ2 + σ2 2σ2β2 nβ2 + σ2 = const · exp − (µ − β2 n
i=1 xi + νσ2
nβ2 + σ2 )2 − β2 n
i=1 xi + νσ2
nβ2 + σ2 2 + β2 n
i=1 x2 i + ν2σ2
nβ2 + σ2 2 σ2β2 nβ2 + σ2 = const · exp −
i=1 xi + νσ2
nβ2 + σ2 2 2 σ2β2 nβ2 + σ2 · exp β2 n
i=1 xi + νσ2
nβ2 + σ2 2 − β2 n
i=1 x2 i + ν2σ2
nβ2 + σ2 2 σ2β2 nβ2 + σ2 = const′ · exp −
i=1 xi + νσ2
nβ2 + σ2 2 2 σ2β2 nβ2 + σ2 31.
The exp term in the last equality being a Gaussian of mean β2 n
i=1 xi + νσ2
nβ2 + σ2 and variance σ2β2 nβ2 + σ2, it follows that its maximum is obtained for µ = β2 n
i=1 xi + νσ2
nβ2 + σ2 = µMAP.
32.
mators for the mean µ as the number of samples n goes to infinity. Solution:
µMLE = n
i=1 xi
n µMAP = σ2ν + β2 n
i=1 xi
σ2 + nβ2 = σ2ν σ2 + nβ2 + β2 n
i=1 xi
σ2 + nβ2 = σ2ν σ2 + nβ2 + 1 n n
i=1 xi
1 + σ2 nβ2 = σ2ν σ2 + nβ2 + µMLE 1 + σ2 nβ2 n → ∞ ⇒ σ2ν σ2 + nβ2 → 0 and σ2 nβ2 → 0 ⇒ µMAP → µMLE 33.
34.
Let X be a random variable distributed according to a Normal distribution with 0 mean, and σ2 variance, i.e. X ∼ N(0, σ2).
MLE.
Solution:
Let X1, X2, . . . , Xn be drawn i.i.d. ∼ N(0, σ2). Let f be the density function corresponding to X. Then we can write the likelihood function as: L(X1, X2, . . . , Xn|σ2) =
n
f(Xi; µ = 0, σ2) =
√ 2πσ n
n
exp
2σ2
√ 2πσ n exp
n
i=1 X2 i
2σ2
= constant − n 2 ln σ2 − 1 2σ2
n
X2
i
⇒ ∂ ln L ∂σ2 = − n 2σ2 + 1 2σ4
n
X2
i . Therefore, ∂ ln L
∂σ2 = 0 ⇔ σ2
MLE = 1
n
n
X2
i
Note:
It can be easily shown that L(X1, X2, . . . , Xn|σ2) indeed reaches its maximum for σ2 = 1 n n
i=1 X2 i .
35.
Solution: It is unbiased, since: E[ 1
n
n
i=1 X2 i ] = n
nE[X2] since i.i.d. = Var [X] + (E[X])2 = Var [X] = σ2 since E[X] = 0
36.
37.
Let x
not.
= (x1, . . . , xn) be observed i.i.d. samples from a Gaussian distribution N(x|µ, σ2).
MLE, the MLE for σ2.
Solution:
The p.d.f. for N(x|µ, σ2) has the form f(x) = 1 √ 2πσ e
−(x − µ)2
2σ2 . The log likelihood function of the data x is: ln L(x | µ, σ2) = ln
n
f(xi) =
n
2 ln(2πσ2) − (xi − µ)2 2σ2
−n 2 ln(2πσ2) − 1 2σ2
n
(xi − µ)2 The partial derivative of ln L w.r.t. σ2: ∂ ln L(x | µ, σ2) ∂σ2 = − n 2σ2 + 1 2σ4 n
i=1(xi − µ)2.
Solving the equation ∂ ln L(x | µ, σ2) ∂σ2 = 0, we get: σ2
MLE = 1
n n
i=1(xi − µMLE)2.
Note that we had to take into account the optimal value of µ (see problem CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1) 38.
MLE] = n − 1
n σ2. Solution:
E[σ2
MLE]
= E
n
n
(xi − µMLE)2
n
n
xi)2
E
1 − 2
nx1
n
xi + 1 n2 (
n
xi)2
E x2
1 − 2
nx1
n
xi + 1 n2
n
x2
i + 2
n2
xixj = E[x2
1] + 1
n2
n
E[x2
i ] − 2
n
n
E[x1xi] + 2 n2
E[xixj] = E[x2
1] + 1
n2 nE[x2
1] − 2
nE[x2
1] − 2
n(n − 1)E[x1x2] + 2 n2 n(n − 1) 2 E[x1x2] = n − 1 n E[x2
1] − n − 1
n E[x1x2] 39.
σ2 = Var(x1) = E[x2
1] − (E[x1])2 = E[x2 1] − µ2 ⇒ E[x2 1] = σ2 + µ2
Because x1 and x2 are independent, it follows that Cov(x1, x2) = 0. Therefore, = Cov(x1, x2) = E[(x1 − E[x1])(x2 − E[x2])] = E[(x1 − µ)(x2 − µ)] = E[x1x2] − µE[x1 + x2] + µ2 = E[x1x2] − µ(E[x1] + E[x2]) + µ2 = E[x1x2] − µ(2µ) + µ2 = E[x1x2] − µ2 So, E[x1x2] = µ2. By substituting E[x2
1] = σ2 + µ2 and E[x1x2] = µ2 into the previously obtained
equality (E[σMLE] = n − 1 n E[x2
1] − n − 1
n E[x1x2]), we get: E[σ2
MLE] = n − 1
n (σ2 + µ2) − n − 1 n µ2 = n − 1 n σ2 40.
Solution: It can be immediately proven that 1 n − 1 n
i=1(xi − µMLE)2 is an un-
biased estimator of σ2.
41.
42.
The density function of a d-dimensional Gaussian distribution is as follows: N (x | µ, Λ−1) = exp
2(x − µ)⊤Λ(x − µ)
|Λ−1| , where Λ is the inverse of the covariance matrix, or the so-called precision matrix. Let {x1, x2, . . . , xn} be an i.i.d. sample from a d-dimensional Gaussian distribution. Suppose that n ≫ d. Derive the MLE estimates ˆ µ and ˆ Λ.
43.
Hint
You may find useful the following formulas (taken from Matrix Identities, by Sam Roweis, 1999): (2b) |A−1| = 1 |A| (2e) Tr(AB) = Tr(BA);a more generally, Tr(ABC . . .) = Tr(BC . . . A) = Tr(C . . . AB) = . . . (3b) ∂ ∂X Tr(XA) = ∂ ∂X Tr(AX) = A⊤ (4b) ∂ ∂X ln |X| = (X−1)⊤ = (X⊤)−1 (5c) ∂ ∂X a⊤Xb = ab⊤ (5g) ∂ ∂X (Xa + b)⊤C(Xa + b) = (C + C⊤)(Xa + b)a⊤ Tr(A), the trace of an n-by-n square matrix A, is defined as the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of A, i.e., a11 + . . . + ann.
aSee Theorem 1.3.d from Matrix Analysis for Statistics, 2017, James R. Schott.
44.
Given the x1, . . . , xn data, the log-likelihood function is: l(µ, Λ)
i.i.d.
= ln
n
N(xi|µ, Λ−1) =
n
ln N(xi|µ, Λ−1) = −nd 2 ln(2π) − n 2 ln |Λ−1| − 1 2
n
(xi − µ)⊤Λ(xi − µ)
(2b)
= −nd 2 ln(2π) + n 2 ln |Λ| − 1 2
n
(xi − µ)⊤Λ(xi − µ). For any fixed positive definite precision matrix Λ, the log-likelihood is a quadratic function of µ with a negative leading coefficient, hence a strictly concave function of µ. We then solve ∇µl(µ, Λ) = 0
(5g)
⇐ ⇒ −1 2(Λ + Λ⊤)
n
(µ − xi)(−1) = 0 ⇐ ⇒ Λ
n
(xi − µ) = 0 ⇐ ⇒ Λ
n
xi = nΛµ. From () we get, by the assumption that Λ is invertible, the following estimate of µ: ˆ µ = n
i=1 xi
n , which coincides with the sample mean ¯ x and is constant w.r.t. Λ. 45.
Now that we have l(µ, Λ) ≤ l(ˆ µ, Λ) ∀µ ∈ Rd, Λ being positive definite, we continue to consider Λ by first plugging ˆ µ back in the log-likelihood function (3): l(ˆ µ, Λ) = −nd 2 ln(2π) + n 2 ln |Λ| − 1 2
n
(xi − ¯ x)⊤Λ(xi − ¯ x) (3)
(2e)
= −nd 2 ln(2π) + n 2 (ln |Λ| − Tr(SΛ)), (4) where S is the sample covariance matrix: S = 1 n n
i=1(xi − ¯
x)(xi − ¯ x)⊤. Explanation: (xi − ¯ x)⊤Λ(xi − ¯ x) is a 1-by-1 matrix, therefore (xi − ¯ x)⊤Λ(xi − ¯ x) = Tr((xi − ¯ x)⊤Λ(xi − ¯ x)), and using the (2e) rule, it can be further written as Tr((xi − ¯ x)(xi − ¯ x)⊤Λ). Using another simple rule, Tr(A + B) = Tr(A) + Tr(B) (which can be easily proven), it follows that
n
(xi − ¯ x)⊤Λ(xi − ¯ x) =
n
Tr((xi − ¯ x)(xi − ¯ x)⊤Λ) = Tr(
n
((xi − ¯ x)(xi − ¯ x)⊤Λ)) = Tr((
n
(xi − ¯ x)(xi − ¯ x)⊤)Λ) = Tr((nS)Λ) = nTr(SΛ). 46.
By the fact that ln |Λ| is strictly concave on the domain of positive definite Λ,a and that Tr(SΛ) is linear in Λ, we are able to find the maximum of the expression (4) by solving ∇Λl(ˆ µ, Λ) = 0, which can be provenb to be equivalent to Λ−1 − S = 0. (Therefore, ˆ Σ = ˆ Λ−1 = S.) Since n ≫ d, we can safely assume that S is invertible and get the following estimate: ˆ Λ = S−1.
aSee, for example, Section 3.1.5, Convex Optimization: http://www.stanford.edu/∼boyd/cvxbook/. bApplying (4b) and (3b) on (4) you’ll get (Λ⊤)−1 − S⊤. Then (Λ⊤)−1 − S⊤ = 0 ⇔ (Λ−1)⊤ = S⊤ ⇔ Λ−1 = S.
47.
Λ are in the parameter space and satisfy l(µ, Λ) ≤ l(ˆ µ, Λ) ≤ l(ˆ µ, ˆ Λ) ∀µ ∈ Rd, Λ being positive definite, so they are the MLE estimates.
µ, Λ) in (3): ∇Λl(ˆ µ, Λ)
(4b),(5c)
= n 2(Λ⊤)−1 − 1 2
n
(xi − µ)(xi − µ)⊤ = n 2Λ−1 − 1 2
n
(xi − µ)(xi − µ)⊤ = n 2Λ−1 − n 2S. So, ∇Λl(ˆ µ, Λ) = 0 ⇔ ˆ Λ−1 not. = ˆ Σ = S
not.
= 1 n
n
(xi − µ)(xi − µ)⊤.
48.
Liviu Ciortuz, 2017
49.
The Gamma distribution of parameters r > 0 and β > 0 has the following density function: Gamma(x|r, β) = 1 βrΓ(r)xr−1e
−x
β , for all x > 0, where the Γ symbol designates Euler’s Gamma function.
5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5
x p(x)
5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5
r = 1.0, β = 2.0 r = 2.0, β = 2.0 r = 3.0, β = 2.0 r = 5.0, β = 1.0 r = 9.0, β = 0.5 r = 7.5, β = 1.0 r = 0.5, β = 1.0
Notes:
1 βrΓ(r) is the so-called normalization factor, since it does not depend on x, and +∞
x=−∞ xr−1e −x
β dx = βr Γ(r). 2. Euler’s Gamma function is defined as follows: Γ(r) = +∞ tr−1e−tdt, for all r ∈ R, except for the negative integers. Starting from the definition of Γ, it can be easily shown that Γ(r +1) = rΓ(r) for any r > 0, and also Γ(1) = 1. Therefore, Γ(r +1) = r ·Γ(n) = r·(r−1)·Γ(r−1) = . . . = r·(r−1)·. . .·2·1 = r!, which means that the Γ function generalizes the factorial function. 3. The exponential distribution is a member of the Gamma family of distributions. (Just set r to 1 in Gamma’s density function.) 50.
Consider x1, . . . , xn ∈ R+, all of them [having been] generated by one compo- nent of the above family of distributions. Find the maximum likelihood estimation of the parameters r and β.
L(r, β)
def.
= P(x1, . . . , xn|r, β)
i.i.d.
=
n
P(xi|r, β) = β−rn(Γ(r))−n n
xi r−1 e
− 1
β
n
i=1 xi
ℓ(r, β)
def.
= ln L(r, β) = −rn ln β − n ln Γ(r) + (r − 1)
n
ln xi − 1 β
n
xi. 51.
Now we will calculate the partial derivative of ℓ(r, β) w.r.t. β, and then equate it to 0: ∂ ∂β ℓ(r, β) = −rn β + 1 β2
n
xi = 1 β2 n
xi − rnβ
∂β ℓ(r, β) = 0 ⇔ ˆ β = 1 rn
n
xi > 0. By substituting ˆ β into ℓ(r, β), we will get: ℓ(r, ˆ β) = −rn ln ˆ β − n ln Γ(r) + (r − 1)
n
ln xi − 1 ˆ β
n
xi = rn ln(rn) − rn ln
n
xi − n ln Γ(r) + (r − 1)
n
ln xi − rn n
i=1 xi
·
n
xi = rn ln(rn) − rn
n
xi + 1
n
ln xi 52.
Therefore, by computing the partial derivative of ℓ(r, ˆ β) with respect to r, and then equating this derivative to 0, we will get: ∂ ∂r ℓ(r, ˆ β) = 0 ⇔ n ln(nr) + n − n
n
xi + 1
Γ(r) +
n
ln xi = 0 ⇔ n(ln r − ψ(r)) = −n ln n −
n
ln xi + n ln
n
xi ⇔ ln r − ψ(r) = − ln n − 1 n
n
ln xi + ln
n
xi. The solution of the last equation is ˆ r, the maximum likelihood estimation of the parameter r. 53.