Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - PowerPoint PPT Presentation

Maximum Likelihood Theory Max Turgeon STAT 4690–Applied Multivariate Analysis

Suffjcient Statistics i • We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector µ ∈ R p and its covariance matrix Σ . • Therefore, given a sample Y 1 , . . . , Y n ∼ N p ( µ, Σ) ( n > p ), we only need to estimate ( µ, Σ) . • Obvious candidates: sample mean ¯ Y and sample covariance S n .

Suffjcient Statistics ii • Write down the likelihood : 3  ) n 1 − 1 ( 2( y i − µ ) T Σ − 1 ( y i − µ ) ∏ L = exp   √ (2 π ) p | Σ | i =1 n ( ) 1 − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ = (2 π ) np/ 2 | Σ | n/ 2 exp 2 i =1 • If we take the (natural) logarithm of L and drop any term that does not depend on ( µ, Σ) , we get n ℓ = − n 2 log | Σ | − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) . ∑ 2 i =1

Suffjcient Statistics iii • First, we have 4 • If we can re-express the second summand in terms of ¯ Y and S n , by the Fisher-Neyman factorization theorem, we will then know that ( ¯ Y , S n ) is jointly suffjcient for ( µ, Σ) .

Suffjcient Statistics iv 5 n n ( y i − µ )( y i − µ ) T = ∑ ∑ y − µ ) T ( y i − ¯ y +¯ y − µ )( y i − ¯ y +¯ i =1 i =1 n y ) T + ( y i − ¯ ( y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y )(¯ i =1 y ) T + (¯ y − µ ) T ) +(¯ y − µ )( y i − ¯ y − µ )(¯ n y ) T + n (¯ y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y − µ )(¯ i =1 y − µ ) T . = ( n − 1) S n + n (¯ y − µ )(¯

Suffjcient Statistics v 6 • Next, using the fact that tr( ABC ) = tr( BCA ) , we have

Suffjcient Statistics vi 7 ( n n ) ( y i − µ ) T Σ − 1 ( y i − µ ) = tr ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ ∑ i =1 i =1 ( n ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 n ( ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 ( ) Σ − 1 S n = ( n − 1)tr ( y − µ ) T ) Σ − 1 (¯ + n tr y − µ )(¯ ( ) Σ − 1 S n = ( n − 1)tr y − µ ) T Σ − 1 (¯ + n (¯ y − µ ) .

Maximum Likelihood Estimators log-likelihood is maximised at • Going back to the log-likelihood, we get: • With extra efgort, it can be shown that 8 2 log | Σ |− ( n − 1) ℓ = − n − n ( ) Σ − 1 S n y − µ ) T Σ − 1 (¯ tr 2(¯ y − µ ) . 2 • Since Σ − 1 is positive defjnite, for Σ fjxed, the µ = ¯ ˆ y . − log | Σ | − ( n − 1) tr (Σ − 1 S n ) is maximised at n n Σ = ( n − 1) S n = 1 ˆ ∑ y ) T . ( y i − ¯ y )( y i − ¯ n n i =1 Y , ˆ • In other words : ( ¯ Σ) are the maximum likelihood estimators for ( µ, Σ) .

Maximum Likelihood Estimators • Since the multivariate normal density is “well-behaved”, we can deduce the usual properties: achieves the Cramér-Rao lower bound. 9 • Consistency : ( ¯ Y , ˆ Σ) converges in probability to ( µ, Σ) . • Effjciency : Asymptotically, the covariance of ( ¯ Y , ˆ Σ) • Invariance : For any transformation ( g ( µ ) , G (Σ)) of ( µ, Σ) , its MLE is ( g ( ¯ Y ) , G (ˆ Σ)) .

Visualizing the likelihood library (mvtnorm) set.seed (123) mu <- c (1, 2) Y <- rmvnorm (n, mean = mu, sigma = Sigma) 10 n <- 50; p <- 2 Sigma <- matrix ( c (1, 0.5, 0.5, 1), ncol = p)

Visualizing the likelihood loglik <- function (mu, sigma, data = Y) { # Compute quantities y_bar <- colMeans (Y) # Compute quadratic form quad_form <- drop ( t (y_bar - mu) %*% Sigma_inv %*% (y_bar - mu)) -0.5 * n *log ( det (sigma)) - 0.5 * (n - 1) *sum ( diag (Sigma_inv %*% Sn)) - 0.5 * n * quad_form } 11 Sn <- cov (Y) Sigma_inv <- solve (sigma)

grid_xy <- expand.grid ( seq (0.5, 1.5, length.out = 32), seq (1, 3, length.out = 32)) contours <- purrr ::map_df ( seq_len ( nrow (grid_xy)), function (i) { # Where we will evaluate loglik mu_obs <- as.numeric (grid_xy[i,]) # Evaluate at the pop covariance z <- loglik (mu_obs, sigma = Sigma) # Output data.frame data.frame (x = mu_obs[1], y = mu_obs[2], z = z) }) 12

Visualizing the likelihood i library (tidyverse) library (ggrepel) # Create df with pop and sample means data_means <- data.frame (x = c (mu[1], mean (Y[,1])), y = c (mu[2], mean (Y[,2])), label = c ("Pop.", "Sample")) 13

Visualizing the likelihood ii contours %>% ggplot ( aes (x, y)) + geom_contour ( aes (z = z)) + geom_point (data = data_means) + geom_label_repel (data = data_means, aes (label = label)) 14

Visualizing the likelihood iii 15 3.0 2.5 Sample 2.0 y Pop. 1.5 1.0 0.50 0.75 1.00 1.25 1.50 x

Visualizing the likelihood iv library (scatterplot3d) with (contours, scatterplot3d (x, y, z)) 16

Visualizing the likelihood v 17 −30 −40 −50 −60 z −70 −80 3.0 y 2.5 −90 2.0 1.5 −100 1.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 x

Sampling Distributions • ; • Recall the univariate case: • • In the multivariate case, we have similar results: 18 • ¯ ( µ, σ 2 /n X ∼ N ) ; ( n − 1) s 2 ∼ χ 2 ( n − 1) ; σ 2 ¯ X and s 2 are independent. ( ) • ¯ µ, 1 Y ∼ N p n Σ • ( n − 1) S n = n ˆ Σ follows a Wishart distribution with n − 1 degrees of freedom; • ¯ Y and S n are independent.

Wishart Distribution freedom. • From the previous slide: 19 distributed. Then we say that • Suppose Z 1 , . . . , Z n ∼ N p (0 , Σ) are independently n Z i Z T ∑ W = i i =1 follows a Wishart distribution W n (Σ) with n degrees of • Note that since E ( Z i Z T i ) = Σ , we have E ( W ) = n Σ . ∑ n i =1 ( Y i − ¯ Y )( Y i − ¯ Y ) T has the same distribution as ∑ n − 1 i =1 Z i Z T i for some choice of Z 1 , . . . , Z n − 1 ∼ N p (0 , Σ) .

Useful Properties then 20 • If W 1 ∼ W n 1 (Σ) and W 2 ∼ W n 2 (Σ) are independent, W 1 + W 2 ∼ W n 1 + n 2 (Σ) . • If W ∼ W n (Σ) and C is q × p , then CWC T ∼ W n ( C Σ C T ) .

Density function matrices. 21 • Let Σ be a fjxed p × p positive defjnite matrix. The density of the Wishart distribution with n degrees of freedom, with n ≥ p , is given by | A | ( n − p − 1) / 2 exp ( ) − 1 2 tr(Σ − 1 A ) w n ( A ; Σ) = ) , 2 np/ 2 π p ( p − 1) / 4 | Σ | n/ 2 ∏ p ( 1 2 ( n − i + 1) i =1 where A is ranging over all p × p positive defjnite

Eigenvalue density function • We will study this distribution in STAT 7200–Multivariate Analysis I 22 • For a random matrix A ∼ W n ( I p ) with n ≥ p , the joint distribution of its eigenvalues λ 1 ≥ · · · ≥ λ p has density p p ( ) − 1 λ ( n − p − 1) / 2 ∑ ∏ ∏ C n,p exp | λ i − λ j | , i 2 i =1 i =1 i<j for some constant C n,p .

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - PowerPoint PPT Presentation

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate Analysis Suffjcient Statistics i We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector R p and

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Outline n Maximum likelihood (ML) n Priors, and maximum a posteriori (MAP) n

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many

Fast and Stable Maximum Likelihood Estimation for Incomplete Multinomial Models Chenyang Zhang,

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with

About this class Point Estimators The next two lectures are really coming from Lets say we

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate - PowerPoint PPT Presentation

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate Analysis Suffjcient Statistics i We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector R p and

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Outline n Maximum likelihood (ML) n Priors, and maximum a posteriori (MAP) n

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many

Fast and Stable Maximum Likelihood Estimation for Incomplete Multinomial Models Chenyang Zhang,

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with

About this class Point Estimators The next two lectures are really coming from Lets say we

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

6. Linear &amp; logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,