maximum likelihood theory
play

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - PowerPoint PPT Presentation

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate Analysis Suffjcient Statistics i We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector R p and


  1. Maximum Likelihood Theory Max Turgeon STAT 4690–Applied Multivariate Analysis

  2. Suffjcient Statistics i • We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector µ ∈ R p and its covariance matrix Σ . • Therefore, given a sample Y 1 , . . . , Y n ∼ N p ( µ, Σ) ( n > p ), we only need to estimate ( µ, Σ) . • Obvious candidates: sample mean ¯ Y and sample covariance S n .

  3. Suffjcient Statistics ii • Write down the likelihood : 3  ) n 1 − 1 ( 2( y i − µ ) T Σ − 1 ( y i − µ ) ∏ L = exp   √ (2 π ) p | Σ | i =1 n ( ) 1 − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ = (2 π ) np/ 2 | Σ | n/ 2 exp 2 i =1 • If we take the (natural) logarithm of L and drop any term that does not depend on ( µ, Σ) , we get n ℓ = − n 2 log | Σ | − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) . ∑ 2 i =1

  4. Suffjcient Statistics iii • First, we have 4 • If we can re-express the second summand in terms of ¯ Y and S n , by the Fisher-Neyman factorization theorem, we will then know that ( ¯ Y , S n ) is jointly suffjcient for ( µ, Σ) .

  5. Suffjcient Statistics iv 5 n n ( y i − µ )( y i − µ ) T = ∑ ∑ y − µ ) T ( y i − ¯ y +¯ y − µ )( y i − ¯ y +¯ i =1 i =1 n y ) T + ( y i − ¯ ( y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y )(¯ i =1 y ) T + (¯ y − µ ) T ) +(¯ y − µ )( y i − ¯ y − µ )(¯ n y ) T + n (¯ y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y − µ )(¯ i =1 y − µ ) T . = ( n − 1) S n + n (¯ y − µ )(¯

  6. Suffjcient Statistics v 6 • Next, using the fact that tr( ABC ) = tr( BCA ) , we have

  7. Suffjcient Statistics vi 7 ( n n ) ( y i − µ ) T Σ − 1 ( y i − µ ) = tr ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ ∑ i =1 i =1 ( n ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 n ( ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 ( ) Σ − 1 S n = ( n − 1)tr ( y − µ ) T ) Σ − 1 (¯ + n tr y − µ )(¯ ( ) Σ − 1 S n = ( n − 1)tr y − µ ) T Σ − 1 (¯ + n (¯ y − µ ) .

  8. Maximum Likelihood Estimators log-likelihood is maximised at • Going back to the log-likelihood, we get: • With extra efgort, it can be shown that 8 2 log | Σ |− ( n − 1) ℓ = − n − n ( ) Σ − 1 S n y − µ ) T Σ − 1 (¯ tr 2(¯ y − µ ) . 2 • Since Σ − 1 is positive defjnite, for Σ fjxed, the µ = ¯ ˆ y . − log | Σ | − ( n − 1) tr (Σ − 1 S n ) is maximised at n n Σ = ( n − 1) S n = 1 ˆ ∑ y ) T . ( y i − ¯ y )( y i − ¯ n n i =1 Y , ˆ • In other words : ( ¯ Σ) are the maximum likelihood estimators for ( µ, Σ) .

  9. Maximum Likelihood Estimators • Since the multivariate normal density is “well-behaved”, we can deduce the usual properties: achieves the Cramér-Rao lower bound. 9 • Consistency : ( ¯ Y , ˆ Σ) converges in probability to ( µ, Σ) . • Effjciency : Asymptotically, the covariance of ( ¯ Y , ˆ Σ) • Invariance : For any transformation ( g ( µ ) , G (Σ)) of ( µ, Σ) , its MLE is ( g ( ¯ Y ) , G (ˆ Σ)) .

  10. Visualizing the likelihood library (mvtnorm) set.seed (123) mu <- c (1, 2) Y <- rmvnorm (n, mean = mu, sigma = Sigma) 10 n <- 50; p <- 2 Sigma <- matrix ( c (1, 0.5, 0.5, 1), ncol = p)

  11. Visualizing the likelihood loglik <- function (mu, sigma, data = Y) { # Compute quantities y_bar <- colMeans (Y) # Compute quadratic form quad_form <- drop ( t (y_bar - mu) %*% Sigma_inv %*% (y_bar - mu)) -0.5 * n *log ( det (sigma)) - 0.5 * (n - 1) *sum ( diag (Sigma_inv %*% Sn)) - 0.5 * n * quad_form } 11 Sn <- cov (Y) Sigma_inv <- solve (sigma)

  12. grid_xy <- expand.grid ( seq (0.5, 1.5, length.out = 32), seq (1, 3, length.out = 32)) contours <- purrr ::map_df ( seq_len ( nrow (grid_xy)), function (i) { # Where we will evaluate loglik mu_obs <- as.numeric (grid_xy[i,]) # Evaluate at the pop covariance z <- loglik (mu_obs, sigma = Sigma) # Output data.frame data.frame (x = mu_obs[1], y = mu_obs[2], z = z) }) 12

  13. Visualizing the likelihood i library (tidyverse) library (ggrepel) # Create df with pop and sample means data_means <- data.frame (x = c (mu[1], mean (Y[,1])), y = c (mu[2], mean (Y[,2])), label = c ("Pop.", "Sample")) 13

  14. Visualizing the likelihood ii contours %>% ggplot ( aes (x, y)) + geom_contour ( aes (z = z)) + geom_point (data = data_means) + geom_label_repel (data = data_means, aes (label = label)) 14

  15. Visualizing the likelihood iii 15 3.0 2.5 Sample 2.0 y Pop. 1.5 1.0 0.50 0.75 1.00 1.25 1.50 x

  16. Visualizing the likelihood iv library (scatterplot3d) with (contours, scatterplot3d (x, y, z)) 16

  17. Visualizing the likelihood v 17 −30 −40 −50 −60 z −70 −80 3.0 y 2.5 −90 2.0 1.5 −100 1.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 x

  18. Sampling Distributions • ; • Recall the univariate case: • • In the multivariate case, we have similar results: 18 • ¯ ( µ, σ 2 /n X ∼ N ) ; ( n − 1) s 2 ∼ χ 2 ( n − 1) ; σ 2 ¯ X and s 2 are independent. ( ) • ¯ µ, 1 Y ∼ N p n Σ • ( n − 1) S n = n ˆ Σ follows a Wishart distribution with n − 1 degrees of freedom; • ¯ Y and S n are independent.

  19. Wishart Distribution freedom. • From the previous slide: 19 distributed. Then we say that • Suppose Z 1 , . . . , Z n ∼ N p (0 , Σ) are independently n Z i Z T ∑ W = i i =1 follows a Wishart distribution W n (Σ) with n degrees of • Note that since E ( Z i Z T i ) = Σ , we have E ( W ) = n Σ . ∑ n i =1 ( Y i − ¯ Y )( Y i − ¯ Y ) T has the same distribution as ∑ n − 1 i =1 Z i Z T i for some choice of Z 1 , . . . , Z n − 1 ∼ N p (0 , Σ) .

  20. Useful Properties then 20 • If W 1 ∼ W n 1 (Σ) and W 2 ∼ W n 2 (Σ) are independent, W 1 + W 2 ∼ W n 1 + n 2 (Σ) . • If W ∼ W n (Σ) and C is q × p , then CWC T ∼ W n ( C Σ C T ) .

  21. Density function matrices. 21 • Let Σ be a fjxed p × p positive defjnite matrix. The density of the Wishart distribution with n degrees of freedom, with n ≥ p , is given by | A | ( n − p − 1) / 2 exp ( ) − 1 2 tr(Σ − 1 A ) w n ( A ; Σ) = ) , 2 np/ 2 π p ( p − 1) / 4 | Σ | n/ 2 ∏ p ( 1 2 ( n − i + 1) i =1 where A is ranging over all p × p positive defjnite

  22. Eigenvalue density function • We will study this distribution in STAT 7200–Multivariate Analysis I 22 • For a random matrix A ∼ W n ( I p ) with n ≥ p , the joint distribution of its eigenvalues λ 1 ≥ · · · ≥ λ p has density p p ( ) − 1 λ ( n − p − 1) / 2 ∑ ∏ ∏ C n,p exp | λ i − λ j | , i 2 i =1 i =1 i<j for some constant C n,p .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend