Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - - PowerPoint PPT Presentation

maximum likelihood theory
SMART_READER_LITE
LIVE PREVIEW

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - - PowerPoint PPT Presentation

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate Analysis Suffjcient Statistics i We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector R p and


slide-1
SLIDE 1

Maximum Likelihood Theory

Max Turgeon

STAT 4690–Applied Multivariate Analysis

slide-2
SLIDE 2

Suffjcient Statistics i

  • We saw in the previous lecture that the multivariate

normal distribution is completely determined by its mean vector µ ∈ Rp and its covariance matrix Σ.

  • Therefore, given a sample Y1, . . . , Yn ∼ Np(µ, Σ)

(n > p), we only need to estimate (µ, Σ).

  • Obvious candidates: sample mean ¯

Y and sample covariance Sn.

2

slide-3
SLIDE 3

Suffjcient Statistics ii

  • Write down the likelihood:

L =

n

i=1

 

1

(2π)p|Σ| exp

(

−1 2(yi − µ)TΣ−1(yi − µ)

) 

= 1 (2π)np/2|Σ|n/2 exp

(

−1 2

n

i=1

(yi − µ)TΣ−1(yi − µ)

)

  • If we take the (natural) logarithm of L and drop any term

that does not depend on (µ, Σ), we get ℓ = −n 2 log|Σ| − 1 2

n

i=1

(yi − µ)TΣ−1(yi − µ).

3

slide-4
SLIDE 4

Suffjcient Statistics iii

  • If we can re-express the second summand in terms of ¯

Y and Sn, by the Fisher-Neyman factorization theorem, we will then know that ( ¯ Y, Sn) is jointly suffjcient for (µ, Σ).

  • First, we have

4

slide-5
SLIDE 5

Suffjcient Statistics iv

n

i=1

(yi−µ)(yi−µ)T =

n

i=1

(yi−¯ y+¯ y−µ)(yi−¯ y+¯ y−µ)T =

n

i=1

(

(yi − ¯ y)(yi − ¯ y)T + (yi − ¯ y)(¯ y − µ)T +(¯ y − µ)(yi − ¯ y)T + (¯ y − µ)(¯ y − µ)T ) =

n

i=1

(yi − ¯ y)(yi − ¯ y)T + n(¯ y − µ)(¯ y − µ)T = (n − 1)Sn + n(¯ y − µ)(¯ y − µ)T.

5

slide-6
SLIDE 6

Suffjcient Statistics v

  • Next, using the fact that tr(ABC) = tr(BCA), we have

6

slide-7
SLIDE 7

Suffjcient Statistics vi

n

i=1

(yi − µ)TΣ−1(yi − µ) = tr

( n ∑

i=1

(yi − µ)TΣ−1(yi − µ)

)

= tr

( n ∑

i=1

Σ−1(yi − µ)(yi − µ)T

)

= tr

(

Σ−1

n

i=1

(yi − µ)(yi − µ)T

)

= (n − 1)tr

(

Σ−1Sn

)

+ ntr

(

Σ−1(¯ y − µ)(¯ y − µ)T ) = (n − 1)tr

(

Σ−1Sn

)

+ n(¯ y − µ)TΣ−1(¯ y − µ).

7

slide-8
SLIDE 8

Maximum Likelihood Estimators

  • Going back to the log-likelihood, we get:

ℓ = −n 2 log|Σ|−(n − 1) 2 tr

(

Σ−1Sn

)

−n 2(¯ y−µ)TΣ−1(¯ y−µ).

  • Since Σ−1 is positive defjnite, for Σ fjxed, the

log-likelihood is maximised at ˆ µ = ¯ y.

  • With extra efgort, it can be shown that

− log|Σ| − (n−1)

n

tr (Σ−1Sn) is maximised at ˆ Σ = (n − 1) n Sn = 1 n

n

i=1

(yi − ¯ y)(yi − ¯ y)T.

  • In other words: ( ¯

Y, ˆ Σ) are the maximum likelihood estimators for (µ, Σ).

8

slide-9
SLIDE 9

Maximum Likelihood Estimators

  • Since the multivariate normal density is “well-behaved”,

we can deduce the usual properties:

  • Consistency: ( ¯

Y, ˆ Σ) converges in probability to (µ, Σ).

  • Effjciency: Asymptotically, the covariance of ( ¯

Y, ˆ Σ) achieves the Cramér-Rao lower bound.

  • Invariance: For any transformation (g(µ), G(Σ)) of

(µ, Σ), its MLE is (g( ¯ Y), G(ˆ Σ)).

9

slide-10
SLIDE 10

Visualizing the likelihood

library(mvtnorm) set.seed(123) n <- 50; p <- 2 mu <- c(1, 2) Sigma <- matrix(c(1, 0.5, 0.5, 1), ncol = p) Y <- rmvnorm(n, mean = mu, sigma = Sigma)

10

slide-11
SLIDE 11

Visualizing the likelihood

loglik <- function(mu, sigma, data = Y) { # Compute quantities y_bar <- colMeans(Y) Sn <- cov(Y) Sigma_inv <- solve(sigma) # Compute quadratic form quad_form <- drop(t(y_bar - mu) %*% Sigma_inv %*% (y_bar - mu))

  • 0.5*n*log(det(sigma)) -

0.5*(n - 1)*sum(diag(Sigma_inv %*% Sn)) - 0.5*n*quad_form }

11

slide-12
SLIDE 12

grid_xy <- expand.grid(seq(0.5, 1.5, length.out = 32), seq(1, 3, length.out = 32)) contours <- purrr::map_df(seq_len(nrow(grid_xy)), function(i) { # Where we will evaluate loglik mu_obs <- as.numeric(grid_xy[i,]) # Evaluate at the pop covariance z <- loglik(mu_obs, sigma = Sigma) # Output data.frame data.frame(x = mu_obs[1], y = mu_obs[2], z = z) })

12

slide-13
SLIDE 13

Visualizing the likelihood i

library(tidyverse) library(ggrepel) # Create df with pop and sample means data_means <- data.frame(x = c(mu[1], mean(Y[,1])), y = c(mu[2], mean(Y[,2])), label = c("Pop.", "Sample"))

13

slide-14
SLIDE 14

Visualizing the likelihood ii

contours %>% ggplot(aes(x, y)) + geom_contour(aes(z = z)) + geom_point(data = data_means) + geom_label_repel(data = data_means, aes(label = label))

14

slide-15
SLIDE 15

Visualizing the likelihood iii

Pop. Sample

1.0 1.5 2.0 2.5 3.0 0.50 0.75 1.00 1.25 1.50

x y

15

slide-16
SLIDE 16

Visualizing the likelihood iv

library(scatterplot3d) with(contours, scatterplot3d(x, y, z))

16

slide-17
SLIDE 17

Visualizing the likelihood v

0.4 0.6 0.8 1.0 1.2 1.4 1.6 −100 −90 −80 −70 −60 −50 −40 −30 1.0 1.5 2.0 2.5 3.0

x y z

17

slide-18
SLIDE 18

Sampling Distributions

  • Recall the univariate case:
  • ¯

X ∼ N

(µ, σ2/n );

  • (n−1)s2

σ2

∼ χ2(n − 1);

  • ¯

X and s2 are independent.

  • In the multivariate case, we have similar results:
  • ¯

Y ∼ Np

(

µ, 1

)

;

  • (n − 1)Sn = nˆ

Σ follows a Wishart distribution with n − 1 degrees of freedom;

  • ¯

Y and Sn are independent.

18

slide-19
SLIDE 19

Wishart Distribution

  • Suppose Z1, . . . , Zn ∼ Np(0, Σ) are independently
  • distributed. Then we say that

W =

n

i=1

ZiZT

i

follows a Wishart distribution Wn(Σ) with n degrees of freedom.

  • Note that since E(ZiZT

i ) = Σ, we have E(W) = nΣ.

  • From the previous slide:

∑n

i=1(Yi − ¯

Y)(Yi − ¯ Y)T has the same distribution as ∑n−1

i=1 ZiZT i for some choice of

Z1, . . . , Zn−1 ∼ Np(0, Σ).

19

slide-20
SLIDE 20

Useful Properties

  • If W1 ∼ Wn1(Σ) and W2 ∼ Wn2(Σ) are independent,

then W1 + W2 ∼ Wn1+n2(Σ).

  • If W ∼ Wn(Σ) and C is q × p, then

CWCT ∼ Wn(CΣCT).

20

slide-21
SLIDE 21

Density function

  • Let Σ be a fjxed p × p positive defjnite matrix. The

density of the Wishart distribution with n degrees of freedom, with n ≥ p, is given by wn(A; Σ) = |A|(n−p−1)/2 exp

(

−1

2tr(Σ−1A)

)

2np/2πp(p−1)/4|Σ|n/2 ∏p

i=1

(

1 2(n − i + 1)

),

where A is ranging over all p × p positive defjnite matrices.

21

slide-22
SLIDE 22

Eigenvalue density function

  • For a random matrix A ∼ Wn(Ip) with n ≥ p, the joint

distribution of its eigenvalues λ1 ≥ · · · ≥ λp has density Cn,p exp

(

−1 2

p

i=1

)

p

i=1

λ(n−p−1)/2

i

i<j

|λi − λj|, for some constant Cn,p.

  • We will study this distribution in STAT 7200–Multivariate

Analysis I

22