Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of - PowerPoint PPT Presentation

MLE on Boundary of Parameter Space All of this goes out the window when we consider possible maxima that occur on the boundary of the domain of a function. For a function whose domain is a one-dimensional interval, this means the endpoints of the interval. 25

MLE on Boundary of Parameter Space (cont.) Suppose X 1 , . . . , X n are IID Unif(0 , θ ). The likelihood is n � L n ( θ ) = f θ ( x i ) i =1 n � 1 = θI [0 ,θ ] ( x i ) i =1 n � = θ − n I [0 ,θ ] ( x i ) i =1 = θ − n I [ x ( n ) , ∞ ) ( θ ) The indicator functions I [0 ,θ ] ( x i ) are all equal to one if and only if x i ≤ θ for all i , which happens if and only if x ( n ) ≤ θ , a condition that is captured in the indicator function on the bottom line. 26

MLE on Boundary of Parameter Space (cont.) ● L n ( θ ) ● 0 x ( n ) θ Likelihood for Unif(0 , θ ) model. 27

MLE on Boundary of Parameter Space (cont.) It is clear from the picture that the unique global maximizer of the likelihood is ˆ θ n = x ( n ) the n -th order statistic, which is the largest data value. For those who want more math, it is often easier to work with the likelihood rather than log likelihood when the MLE is on the boundary. It is clear from the picture that θ �→ θ − n is a decreasing function, hence the maximum must occur at the lower end of the range of validity of this formula, which is at θ = x ( n ) . 28

MLE on Boundary of Parameter Space (cont.) If one doesn’t want to use the picture at all, L ′ n ( θ ) = − nθ − ( n +1) , θ > x ( n ) shows the derivative of L n is negative, hence L n is a decreasing function when θ > x ( n ) , which is the interesting part of the domain. 29

MLE on Boundary of Parameter Space (cont.) Because of the way we defined the likelihood at θ = x ( n ) , the maximum is achieved. This came from the way we defined the PDF f θ ( x ) = 1 θI [0 ,θ ] ( x ) Recall that the definition of a PDF at particular points is ar- bitrary. In particular, we could have defined it arbitrarily at 0 and θ . We chose the definition we did so that the value of the likelihood function at the discontinuity, which is at θ = x ( n ) , is the upper value as indicated by the solid and hollow dots in the picture of the likelihood function. 30

MLE on Boundary of Parameter Space (cont.) For the binomial distribution, there were two cases we did not do: x = 0 and x = n . If ˆ p n = x/n is also the correct MLE for them, then the MLE is on the boundary. Again we use the likelihood L n ( p ) = p x (1 − p ) n − x , 0 < p < 1 In case x = 0, this becomes L n ( p ) = (1 − p ) n , 0 ≤ p ≤ 1 Now that we no longer have to worry about 0 0 being undefined, we can extend the domain to 0 ≤ p ≤ 1. It is easy to check that L n is a decreasing function: draw the graph or check that L ′ n ( p ) < 0 for 0 < p < 1. Hence the unique global maximum occurs at p = 0. The case x = n is similar. In all cases ˆ p = x/n . 31

Usual Asymptotics of MLE The method of maximum likelihood estimation is remarkable in that we can determine the asymptotic distribution of estimators that are defined only implicitly — the maximizer of the log likelihood — and perhaps can only be calculated by computer optimization. In case we do have an explicit expression of the MLE, the asymptotic distribution we now derive must agree with the one calculated via the delta method, but is easier to calculate. 32

Asymptotics for Log Likelihood Derivatives Consider the identity � f θ ( x ) dx = 1 or the analogous identity with summation replacing integration for the discrete case. We assume we can differentiate with respect to θ under the integral sign � � d d f θ ( x ) dx = dθf θ ( x ) dx dθ This operation is usually valid. We won’t worry about precise technical conditions. 33

Asymptotics for Log Likelihood Derivatives (cont.) Since the derivative of a constant is zero, we have � d dθf θ ( x ) dx = 0 Also l ′ ( θ ) = d dθ log f θ ( x ) 1 d = dθf θ ( x ) f θ ( x ) Hence d dθf θ ( x ) = l ′ ( θ ) f θ ( x ) and � l ′ ( θ ) f θ ( x ) dx = E θ { l ′ ( θ ) } 0 = 34

Asymptotics for Log Likelihood Derivatives (cont.) This gives us the first log likelihood derivative identity E θ { l ′ n ( θ ) } = 0 which always holds whenever differentiation under the integral sign is valid (which is usually). Note that it is important that we write E θ for expectation rather than E . The identity holds when the θ in l ′ n ( θ ) and the θ in E θ are the same. 35

Asymptotics for Log Likelihood Derivatives (cont.) For our next trick we differentiate under the integral sign again � d 2 dθ 2 f θ ( x ) dx = 0 Also l ′′ ( θ ) = d 1 d dθf θ ( x ) dθ f θ ( x ) � d � 2 d 2 1 1 = dθ 2 f θ ( x ) − dθf θ ( x ) f θ ( x ) 2 f θ ( x ) Hence d 2 dθ 2 f θ ( x ) = l ′′ ( θ ) f θ ( x ) + l ′ ( θ ) 2 f θ ( x ) and � � l ′′ ( θ ) f θ ( x ) dx + l ′ ( θ ) 2 f θ ( x ) dx = E θ { l ′′ ( θ ) } + E θ { l ′ ( θ ) 2 } 0 = 36

Asymptotics for Log Likelihood Derivatives (cont.) This gives us the second log likelihood derivative identity var θ { l ′ n ( θ ) } = − E θ { l ′′ n ( θ ) } which always holds whenever differentiation under the integral sign is valid (which is usually). The reason why var θ { l ′ n ( θ ) } = E θ { l ′ n ( θ ) 2 } is the first log likelihood derivative identity E θ { l ′ n ( θ ) } = 0. Note that it is again important that we write E θ for expectation and var θ for variance rather then E and var. 37

Asymptotics for Log Likelihood Derivatives (cont.) Summary: E θ { l ′ n ( θ ) } = 0 var θ { l ′ n ( θ ) } = − E θ { l ′′ n ( θ ) } These hold whether the data is discrete or continuous (for discrete data just replace integrals by sums in the preceding proofs). 38

Fisher Information Either side of the second log likelihood derivative identity is called Fisher information I n ( θ ) = var θ { l ′ n ( θ ) } = − E θ { l ′′ n ( θ ) } 39

Asymptotics for Log Likelihood Derivatives (cont.) When the data are IID, then the log likelihood and its derivatives are the sum of IID terms n � l n ( θ ) = log f θ ( x i ) i =1 n � d l ′ n ( θ ) = dθ log f θ ( x i ) i =1 n d 2 � l ′′ n ( θ ) = dθ 2 log f θ ( x i ) i =1 From either of the last two equations we see that I n ( θ ) = nI 1 ( θ ) 40

Asymptotics for Log Likelihood Derivatives (cont.) If we divide either of the equations for likelihood derivatives on the preceding overhead by n , the sums become averages of IID random variables n n ( θ ) = 1 � d n − 1 l ′ dθ log f θ ( x i ) n i =1 n d 2 n ( θ ) = 1 � n − 1 l ′′ dθ 2 log f θ ( x i ) n i =1 Hence the LLN and CLT apply to them. 41

Asymptotics for Log Likelihood Derivatives (cont.) To apply the LLN we need to know the expectation of the individual terms � d � = E θ { l ′ E θ dθ log f θ ( x i ) 1 ( θ ) } = 0 � � d 2 = E θ { l ′′ dθ 2 log f θ ( x i ) 1 ( θ ) } E θ = − I 1 ( θ ) 42

Asymptotics for Log Likelihood Derivatives (cont.) Hence the LLN applied to log likelihood derivatives says P n − 1 l ′ − → 0 n ( θ ) P n − 1 l ′′ n ( θ ) − → − I 1 ( θ ) It is assumed here that θ is the true unknown parameter value, that is, X 1 , X 2 , . . . are IID with PDF or PMF f θ . 43

Asymptotics for Log Likelihood Derivatives (cont.) To apply the CLT we need to know the mean and variance of the individual terms � d � = E θ { l ′ dθ log f θ ( x i ) 1 ( θ ) } E θ = 0 � d � = var θ { l ′ 1 ( θ ) } var θ dθ log f θ ( x i ) = I 1 ( θ ) We don’t know the variance of l ′′ n ( θ ) so we don’t obtain a CLT for it. 44

Asymptotics for Log Likelihood Derivatives (cont.) Hence the CLT applied to log likelihood first derivative says � � � � √ n D n − 1 l ′ n ( θ ) − 0 − → N 0 , I 1 ( θ ) or (cleaning this up a bit) � � D n − 1 / 2 l ′ n ( θ ) − → N 0 , I 1 ( θ ) It is assumed here that θ is the true unknown parameter value, that is, X 1 , X 2 , . . . are IID with PDF or PMF f θ . 45

Asymptotics for MLE The MLE ˆ θ n satisfies l ′ n (ˆ θ n ) = 0 because the MLE is a local maximizer (at least) of the log likelihood. Expand the first derivative of the log likelihood in a Taylor series about the true unknown parameter value, which we now start calling θ 0 l ′ n ( θ ) = l ′ n ( θ 0 ) + l ′′ n ( θ 0 )( θ − θ 0 ) + higher order terms 46

Asymptotics for MLE We rewrite this n − 1 / 2 l ′ n ( θ ) = n − 1 / 2 l ′ n ( θ 0 ) + n − 1 l ′′ n ( θ 0 ) n 1 / 2 ( θ − θ 0 ) + higher order terms because we know the asymptotics of n − 1 / 2 l ′ n ( θ 0 ) and n − 1 l ′′ n ( θ 0 ). Then we assume the higher order terms are negligible when ˆ θ n is plugged in for θ 0 = n − 1 / 2 l ′ n ( θ 0 ) + n − 1 l ′′ n ( θ 0 ) n 1 / 2 (ˆ θ n − θ 0 ) + o p (1) 47

Asymptotics for MLE This implies θ n − θ 0 ) = − n − 1 / 2 l ′ √ n (ˆ n ( θ 0 ) n ( θ 0 ) + o p (1) n − 1 l ′′ and by Slutsky’s theorem √ n (ˆ Y D θ n − θ 0 ) − → − I 1 ( θ 0 ) where � � Y ∼ N 0 , I 1 ( θ 0 ) 48

Asymptotics for MLE Since � � Y − = 0 E I 1 ( θ 0 ) � � Y = I 1 ( θ 0 ) var − I 1 ( θ 0 ) 2 I 1 ( θ 0 ) = I 1 ( θ 0 ) − 1 we get � 0 , I 1 ( θ 0 ) − 1 � √ n (ˆ D θ n − θ 0 ) − → N 49

Asymptotics for MLE It is now safe to get sloppy � θ 0 , n − 1 I 1 ( θ 0 ) − 1 � ˆ θ n ≈ N or � θ 0 , I n ( θ 0 ) − 1 � ˆ θ n ≈ N This is a remarkable result. Without knowing anything about the functional form of the MLE, we have derived its asymptotic distribution. 50

Examples (cont.) We already know the asymptotic distribution for the MLE of the binomial distribution, because it follows directly from the CLT (5101, deck 7, slide 36) � � √ n (ˆ D p n − p ) − → N 0 , p (1 − p ) but let us calculate this using likelihood theory � � l ′′ I n ( p ) = − E p n ( p ) � � − X n − X = − E p p 2 − (1 − p ) 2 p 2 + n − np = np (1 − p ) 2 n = p (1 − p ) (the formula for l ′′ n ( p ) is from slide 20) 51

Examples (cont.) Hence I n ( p ) − 1 = p (1 − p ) n and � � p, p (1 − p ) p n ≈ N ˆ n as we already knew. 52

Examples (cont.) For the IID normal data with known mean µ and unknown variance ν � � l ′′ I n ( ν ) = − E ν n ( ν )   n   � 2 ν 2 − 1 n ( X i − µ ) 2 = − E ν ν 3   i =1 = − n 2 ν 2 + 1 ν 3 · nν n = 2 ν 2 (the formula for l ′′ n ( ν ) is from slide 22). Hence � � ν, 2 ν 2 ˆ ν n ≈ N n 53

Examples (cont.) Or for IID normal data � � σ 2 , 2 σ 4 σ 2 ˆ n ≈ N n because ν = σ 2 . We already knew this from homework problem 4-9. 54

Examples (cont.) Here’s an example we don’t know. Suppose X 1 , X 2 , . . . are IID Gam( α, λ ) where α is unknown and λ is known. Then n λ α � Γ( α ) x α − 1 e − λx i L n ( α ) = i i =1 � � n n λ α � x α − 1 e − λx i = i Γ( α ) i =1 � n     � α − 1 n n λ α � �    − λ  = x i exp x i Γ( α ) i =1 i =1 and we can drop the term that does not contain α . 55

Examples (cont.) The log likelihood is   n �   l n ( α ) = nα log λ − n log Γ( α ) + ( α − 1) log x i i =1 Every term except n log Γ( α ) is linear in α and hence has second derivative with respect to α equal to zero. Hence n ( α ) = − n d 2 l ′′ dα 2 log Γ( α ) and I n ( α ) = n d 2 dα 2 log Γ( α ) because the expectation of a constant is a constant. 56

Examples (cont.) The second derivative of the logarithm of the gamma function is not something we know how to do, but is a “brand name function” called the trigamma function , which can be calculated by R or Mathematica. Again we say, this is a remarkable result. We have no closed form expression for the MLE, but we know its asymptotic distribution is � � 1 α n ≈ N ˆ α, n trigamma( α ) 57

Plug-In for Asymptotic Variance Since we do not know the true unknown parameter value θ 0 , we do not know the Fisher information I 1 ( θ 0 ) either. In order to use the asymptotics of MLE for confidence intervals and hypothesis tests, we need plug in. If θ �→ I 1 ( θ ) is a continuous function, then P I 1 (ˆ θ n ) − → I 1 ( θ 0 ) by the continuous mapping theorem. Hence by the plug-in prin- ciple (Slutsky’s theorem) ˆ √ n · θ n − θ D θ n ) 1 / 2 θ n ) − 1 / 2 = (ˆ θ n − θ ) I n (ˆ − → N (0 , 1) I 1 (ˆ is an asymptotically pivotal quantity that can be used to con- struct confidence intervals and hypothesis tests. 58

Plug-In for Asymptotic Variance (cont.) If z α is the 1 − α quantile of the standard normal distribution, then θ n ) − 1 / 2 ˆ θ n ± z α/ 2 I n (ˆ is an asymptotic 100(1 − α )% confidence interval for θ . 59

Plug-In for Asymptotic Variance (cont.) The test statistic θ n ) 1 / 2 T = (ˆ θ n − θ 0 ) I n (ˆ is asymptotically standard normal under the null hypothesis H 0 : θ = θ 0 As usual, the approximate P -values for upper-tail, lower-tail, and two-tail tests are, respectively, Pr θ 0 ( T ≥ t ) ≈ 1 − Φ( t ) Pr θ 0 ( T ≤ t ) ≈ Φ( t ) � � Pr θ 0 ( | T | ≥ | t | ) ≈ 2 1 − Φ( | t | ) = 2Φ( −| t | ) where Φ is the DF of the standard normal distribution. 60

Plug-In for Asymptotic Variance (cont.) Sometimes the expectation involved in calculating Fisher information is too hard to do. Then we use the following idea. The LLN for the second derivative of the log likelihood (slide 45) says P n − 1 l ′′ n ( θ 0 ) − → − I 1 ( θ 0 ) which motivates the following definition: J n ( θ ) = − l ′′ n ( θ ) is called observed Fisher information . For contrast I n ( θ ) or I 1 ( θ ) is called expected Fisher information , although, strictly speaking, the “expected” is unnecessary. 61

Plug-In for Asymptotic Variance (cont.) The LLN for the second derivative of the log likelihood can be written sloppily J n ( θ ) ≈ I n ( θ ) from which J n (ˆ θ n ) ≈ I n (ˆ θ n ) ≈ I n ( θ 0 ) should also hold, and usually does (although this requires more than just the continuous mapping theorem so we don’t give a proof). 62

Plug-In for Asymptotic Variance (cont.) This gives us two asymptotic 100(1 − α )% confidence intervals for θ θ n ) − 1 / 2 ˆ θ n ± z α/ 2 I n (ˆ θ n ) − 1 / 2 ˆ θ n ± z α/ 2 J n (ˆ and the latter does not require any expectations. If we can write down the log likelihood and differentiate it twice, then we can make the latter confidence interval. 63

Plug-In for Asymptotic Variance (cont.) Similarly, we have two test statistics θ n ) 1 / 2 T = (ˆ θ n − θ 0 ) I n (ˆ θ n ) 1 / 2 T = (ˆ θ n − θ 0 ) J n (ˆ which are asymptotically standard normal under the null hypothesis H 0 : θ = θ 0 and can be used to perform hypothesis tests (as described on slide 60). Again, if we can write down the log likelihood and differentiate it twice, then we can perform the test using latter test statistic. 64

Plug-In for Asymptotic Variance (cont.) Sometimes even differentiating the log likelihood is too hard to do. Then we use the following idea. Derivatives can be approximated by “finite differences” f ′ ( x ) ≈ f ( x + h ) − f ( x ) , when h is small h When derivatives are too hard to do by calculus, they can be approximated by finite differences. 65

Plug-In for Asymptotic Variance (cont.) The R code on the computer examples web page about maximum likelihood for the gamma distribution with shape parameter α unknown and rate parameter λ = 1 known, Rweb> n <- length(x) Rweb> mlogl <- function(a) sum(- dgamma(x, a, log = TRUE)) Rweb> out <- nlm(mlogl, mean(x), hessian = TRUE, fscale = n) Rweb> ahat <- out$estimate Rweb> z <- qnorm(0.975) Rweb> ahat + c(-1, 1) * z / sqrt(n * trigamma(ahat)) [1] 1.271824 2.065787 Rweb> ahat + c(-1, 1) * z / sqrt(out$hessian) [1] 1.271798 2.065813 66

The Information Inequality Suppose ˆ θ n is any unbiased estimator of θ . Then θ n ) ≥ I n ( θ ) − 1 var θ (ˆ which is called the information inequality or the Cram´ er-Rao lower bound . Proof: θ, l ′ ( θ ) } = E θ { ˆ θ · l ′ ( θ ) } − E θ (ˆ θ ) E θ { l ′ ( θ ) } cov θ { ˆ θ · l ′ ( θ ) } = E θ { ˆ because of the first log likelihood derivative identity. 67

The Information Inequality (cont.) And � � � f θ ( x ) · d 1 θ · l ′ ( θ ) } = E θ { ˆ ˆ θ ( x ) dθf θ ( x ) f θ ( x ) dx � d � � ˆ = θ ( x ) dθf θ ( x ) dx � = d ˆ θ ( x ) f θ ( x ) dx dθ = d dθE θ (ˆ θ ) assuming differentiation under the integral sign is valid. 68

The Information Inequality (cont.) By assumption ˆ θ is unbiased, which means E θ (ˆ θ ) = θ and d dθE θ (ˆ θ ) = 1 Hence θ, l ′ ( θ ) } = 1 cov θ { ˆ 69

The Information Inequality (cont.) From the correlation inequality (5101, deck 6, slide 61) θ, l ′ ( θ ) } 2 1 ≥ cor θ { ˆ θ, l ′ ( θ ) } 2 cov θ { ˆ = var θ (ˆ θ ) var θ { l ′ ( θ ) } 1 = var θ (ˆ θ ) I ( θ ) from which the information inequality follows immediately. 70

The Information Inequality (cont.) The information inequality says no unbiased estimator can be more efficient than the MLE. But what about biased estimators? They can be more efficient. An estimator that is better than the MLE in the ARE sense is called superefficient , and such estimators do exist. The H´ ajek convolution theorem says no estimator that is asymptotically unbiased in a certain sense can be superefficient. The Le Cam convolution theorem says no estimator can be superefficient except at a set of true unknown parameter points of measure zero. 71

The Information Inequality (cont.) In summary, the MLE is as about as efficient as an estimator can be. For exact theory, we only know that no unbiased estimator can be superefficient. For asymptotic theory, we know that no estimator can be superefficient except at a negligible set of true unknown parameter values. 72

Multiparameter Maximum Likelihood The basic ideas are the same when there are multiple unknown parameters rather than just one. We have to generalize each topic • conditions for local and global maxima, • log likelihood derivative identities, • Fisher information, and • asymptotics and plug-in. 73

Multivariate Differentiation This topic was introduced last semester (5101, deck 7, slides 96– 98). Here we review and specialize to scalar-valued functions. If W is an open region of R p , then f : W → R is differentiable if all partial derivatives exist and are continuous, in which case the vector of partial derivatives evaluated at x is called the gradient vector at x and is denoted ∇ f ( x ).   ∂f ( x ) /∂x 1   ∂f ( x ) /∂x 2   ∇ f ( x ) =   . . .   ∂f ( x ) /∂x p 74

Multivariate Differentiation (cont.) If W is an open region of R p , then f : W → R is twice differentiable if all second partial derivatives exist and are continuous, in which case the matrix of second partial derivatives evaluated at x is called the Hessian matrix at x and is denoted ∇ 2 f ( x ).   ∂ 2 f ( x ) ∂ 2 f ( x ) ∂ 2 f ( x ) · · · ∂x 2  ∂x 1 ∂x 2 ∂x 1 ∂x p    1 ∂ 2 f ( x ) ∂ 2 f ( x ) ∂ 2 f ( x )   · · ·   ∇ 2 f ( x ) = ∂x 2  ∂x 2 ∂x 1 ∂x 2 ∂x p   2  . . . ... . . .   . . .     ∂ 2 f ( x ) ∂ 2 f ( x ) ∂ 2 f ( x ) · · · ∂x p ∂x 1 ∂x p ∂x 2 ∂x 2 p 75

Local Maxima Suppose W is an open region of R p and f : W → R is a twice- differentiable function. A necessary condition for a point x ∈ W to be a local maximum of f is ∇ f ( x ) = 0 and a sufficient condition for x to be a local maximum is ∇ 2 f ( x ) is a negative definite matrix 76

Positive Definite Matrices A symmetric matrix M is positive semi-definite (5101, deck 2, slides 68–69 and deck 5, slides 103–105) if w T Mw ≥ 0 , for all vectors w and positive definite if w T Mw > 0 , for all nonzero vectors w . A symmetric matrix M is negative semi-definite if − M is positive semidefinite, and M is negative definite − M is positive definite. 77

Positive Definite Matrices (cont.) There are two ways to check that the Hessian matrix is negative semi-definite. First, one can try to verify that p p ∂ 2 f ( x ) � � < 0 w i w j ∂x i ∂x j i =1 j =1 holds for all real numbers w 1 , . . . , w p , at least one of which is nonzero (5101, deck 2, slides 68–69). This is hard. Second, one can verify that all the eigenvalues are negative (5101, deck 5, slides 103–105). This can be done by computer, but can only be applied to a numerical matrix that has particular values plugged in for all variables and parameters. 78

Local Maxima (cont.) The first-order condition for a local maximum is not much harder than before. Set all first partial derivatives to zero and solve for the variables. The second-order condition is harder when done by hand. The computer check that all eigenvalues are negative is easy. 79

Examples (cont.) The log likelihood for the two-parameter normal model is x n − µ ) 2 l n ( µ, ν ) = − n 2 log( ν ) − nv n 2 ν − n (¯ 2 ν (slide 10). The first partial derivatives are ∂l n ( µ, ν ) = n (¯ x n − µ ) ∂µ ν x n − µ ) 2 ∂l n ( µ, ν ) 2 ν 2 + n (¯ = − n 2 ν + nv n 2 ν 2 ∂ν 80

Examples (cont.) The second partial derivatives are ∂ 2 l n ( µ, ν ) = − n ∂µ 2 ν ∂ 2 l n ( µ, ν ) x n − µ ) = − n (¯ ν 2 ∂µ∂ν ∂ 2 l n ( µ, ν ) x n − µ ) 2 = + n 2 ν 2 − nv n ν 3 − n (¯ ∂ν 2 ν 3 81

Examples (cont.) Setting the first partial derivative with respect to µ equal to zero and solving for µ gives µ = ¯ x n Plugging that into the first partial derivative with respect to ν set equal to zero gives − n 2 ν + nv n 2 ν 2 = 0 and solving for ν gives ν = v n 82

Examples (cont.) Thus the MLE for the two-parameter normal model are ˆ µ n = ¯ x n ˆ ν n = v n and we can also denote the latter σ 2 ˆ n = v n 83

Examples (cont.) Plugging the MLE into the second partial derivatives gives ∂ 2 l n (ˆ µ n , ˆ ν n ) = − n ∂µ 2 ˆ ν n ∂ 2 l n (ˆ µ n , ˆ ν n ) = 0 ∂µ∂ν ∂ 2 l n (ˆ µ n , ˆ ν n ) = + n − nv n ∂ν 2 ν 2 ν 3 2ˆ ˆ n n = − n ν 2 2ˆ n Hence the Hessian matrix is diagonal, and is negative definite if each of the diagonal terms is negative (5101, deck 5, slide 106), which they are. Thus the MLE is a local maximizer of the log likelihood. 84

Global Maxima A region W of R p is convex if s x +(1 − s ) y ∈ W, whenever x ∈ W and y ∈ W and 0 < s < 1 Suppose W is an open convex region of R p and f : W → R is a twice-differentiable function. If ∇ 2 f ( y ) is a negative definite matrix for all y ∈ W, then f is called strictly concave . In this case ∇ f ( x ) = 0 is a sufficient condition for x to be the unique global maximum. 85

Log Likelihood Derivative Identities The same differentiation under the integral sign argument applied to partial derivatives results in � � ∂l n ( θ ) E θ = 0 ∂θ i � � � � ∂ 2 l n ( θ ) ∂l n ( θ ) · ∂l n ( θ ) = − E θ E θ ∂θ i ∂θ j ∂θ i ∂θ j which can be rewritten in matrix notation as E θ {∇ l n ( θ ) } = 0 var θ {∇ l n ( θ ) } = − E θ {∇ 2 l n ( θ ) } (compare with slide 38). 86

Fisher Information As in the uniparameter case, either side of the second log likelihood derivative identity is called Fisher information I n ( θ ) = var θ {∇ l n ( θ ) } = − E θ {∇ 2 l n ( θ ) } Being a variance matrix, the Fisher information matrix is symmetric and positive semi-definite. Usually the Fisher information matrix is actually positive definite, and we will always assume this. 87

Examples (cont.) Returning to the two-parameter normal model, and taking expectations of the second partial derivatives gives � � ∂ 2 l n ( µ, ν ) = − n E µ,ν ∂µ 2 ν � � ∂ 2 l n ( µ, ν ) = − nE µ,ν ( X n − µ ) E µ,ν ν 2 ∂µ∂ν = 0 � � ∂ 2 l n ( µ, ν ) − nE µ,ν { ( X n − µ ) 2 } 2 ν 2 − nE µ,ν ( V n ) = + n E µ,ν ∂ν 2 ν 3 ν 3 2 ν 2 − ( n − 1) E µ,ν ( S 2 = + n n ) − n var µ,ν ( X n ) ν 3 ν 3 = − n 2 ν 2 88

Examples (cont.) Hence for the two-parameter normal model the Fisher information matrix is � � n/ν 0 I n ( θ ) = n/ (2 ν 2 ) 0 89

Asymptotics for Log Likelihood Derivatives (cont.) The same CLT argument applied to the gradient vector gives � � D n − 1 / 2 ∇ l n ( θ 0 ) − → N 0 , I 1 ( θ 0 ) and the same LLN argument applied to the Hessian matrix gives P − n − 1 ∇ 2 l n ( θ 0 ) − → I 1 ( θ 0 ) These are multivariate convergence in distribution and multivariate convergence in probability statements (5101, deck 7, slides 73–78 and 79–85). 90

Asymptotics for MLE (cont.) The same argument — expand the gradient of the log likelihood in a Taylor series, assume terms after the first two are negligible, and apply Slutsky — used in the univariate case gives for the asymptotics of the MLE � 0 , I 1 ( θ 0 ) − 1 � √ n ( ˆ D θ n − θ 0 ) − → N or the sloppy version � θ 0 , I n ( θ 0 ) − 1 � ˆ θ n ≈ N (compare slides 46–50). Since Fisher information is a matrix, I n ( θ 0 ) − 1 must be a matrix inverse. 91

Examples (cont.) Returning to the two-parameter normal model, inverse Fisher information is � � � � σ 2 /n ν/n 0 0 I n ( θ ) − 1 = = 2 ν 2 /n 2 σ 4 /n 0 0 Because the asymptotic covariance is zero, the two components of the MLE are asymptotically independent (actually we know they are exactly, not just asymptotically independent, deck 1, slide 58 ff.) and their asymptotic distributions are � � µ, σ 2 X n ≈ N n � � σ 2 , 2 σ 4 V n ≈ N n 92

Examples (cont.) We already knew these asymptotic distributions of, the former being the CLT and the latter being homework problem 4-9. 93

Examples (cont.) Now for something we didn’t already know. Taking logs in the formula for the likelihood of the gamma distribution (slide 55) gives   n n � �  − λ  l n ( α, λ ) = nα log λ − n log Γ( α ) + ( α − 1) log x i x i i =1 i =1 n n � � = nα log λ − n log Γ( α ) + ( α − 1) log( x i ) − λ x i i =1 i =1 = nα log λ − n log Γ( α ) + n ( α − 1)¯ y n − nλ ¯ x n where n � y n = 1 ¯ log( x i ) n i =1 94

Examples (cont.) ∂l n ( α, λ ) = n log λ − n digamma( α ) + n ¯ y n ∂α ∂l n ( α, λ ) = nα λ − n ¯ x n ∂λ ∂ 2 l n ( α, λ ) = − n trigamma( α ) ∂α 2 ∂ 2 l n ( α, λ ) = n ∂α∂λ λ ∂ 2 l n ( α, λ ) = − nα ∂λ 2 λ 2 95

Examples (cont.) If we set first partial derivatives equal to zero and solve for the parameters, we find we cannot. The MLE can only be found by the computer, maximizing the log likelihood for particular data. We do, however, know the asymptotic distribution of the MLE � � �� α n ˆ α , I n ( θ ) − 1 ≈ N ˆ λ λ n where � � n trigamma( α ) − n/λ I n ( θ ) = nα/λ 2 − n/λ 96

Plug-In for Asymptotic Variance (cont.) As always, since we don’t know θ , we must use a plug-in estimate for asymptotic variance. As in the uniparameter case, we can use either expected Fisher information. � θ n ) − 1 � ˆ θ , I n ( ˆ θ n ≈ N or observed Fisher information. � θ n ) − 1 � ˆ θ , J n ( ˆ θ n ≈ N where J n ( θ ) = −∇ 2 l n ( θ ) 97

Caution There is a big difference between the Right Thing (standard errors for MLE are square roots of diagonal elements of the inverse Fisher information matrix) and the Wrong Thing (square roots of inverses of square roots of diagonal elements of the Fisher information matrix) Rweb:> fish [,1] [,2] [1,] 24.15495 -29.71683 [2,] -29.71683 49.46866 Rweb:> 1 / sqrt(diag(fish)) # Wrong Thing [1] 0.2034684 0.1421788 Rweb:> sqrt(diag(solve(fish))) # Right Thing [1] 0.3983007 0.2783229 98

Starting Points for Optimization When a maximum likelihood problem is not concave, there can be more than one local maximum. Theory says one of those local maxima is the efficient estimator which has inverse Fisher information for its asymptotic variance. The rest of the local maxima are no good. How to find the right one? Theory says that if the starting point for optimization is a “root n consistent” estimator, that is, ˜ θ n such that θ n = θ 0 + O p ( n − 1 / 2 ) ˜ and any CAN estimator satisfies this, for example, method of moments estimators and sample quantiles. 99

Invariance of Maximum Likelihood If ψ = g ( θ ) is an invertible change-of-parameter, and ˆ θ n is the MLE for θ , then ˆ ψ n = g ( ˆ θ n ) is the MLE for ψ . This is obvious if one thinks of ψ and θ as locations in differ- ent coordinate systems for the same geometric object, which denotes a probability distribution. The likelihood function, while defined as a function of the parameter, clearly only depends on the distribution the parameter indicates. Hence this invariance. 100

Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of - PowerPoint PPT Presentation

Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota 1 Likelihood Inference We have learned one very general method of estimation: the method of moments. Now we learn another: the method of maximum

Stat 5102 Lecture Slides: Deck 3 Likelihood Inference Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 7 Model Selection Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 5 Linear Models Charles J. Geyer School of Statistics University

Stat 5102 Lecture Slides Deck 5 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 6 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 8 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 1 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides: Deck 4 Bayesian Inference Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models,

Stat 5102 Lecture Slides: Deck 8 Bootstrap Charles J. Geyer School of Statistics University of

Stat 5102 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Lady Duvera Picture Presentation Starboard side Bathing platform Bow Outside dining Bridge Deck

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

DECK REFEREE CLINIC PACIFIC SWIMMING OFFICIALS CLINIC OCTOBER 201 9 MICHAEL DAVIS DE DECK

MATH 12002 - CALCULUS I 2.6: Implicit Differentiation Professor Donald L. White Department of

What Can It Look Like in the Science Classroom? Jeremy Peacock, Science Northeast Georgia RESA

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

Multiple Nested Reductions of Single Data Modes as a Tool to Deal with Large Data Sets Iven Van

Generalized Derivatives Automatic Evaluation & Implications for Algorithms Paul I. Barton,

Differentiating the Flipped Classroom Eric M. Carbaugh, PhD -

Chapter 7: Product Differentiation A1. Firms meet only once in the market. Relax A2. Products are

Differentiating Exponential Functions Lots of real world processes have behavior which can be

Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of - PowerPoint PPT Presentation

Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota 1 Likelihood Inference We have learned one very general method of estimation: the method of moments. Now we learn another: the method of maximum

Stat 5102 Lecture Slides: Deck 3 Likelihood Inference Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 7 Model Selection Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 5 Linear Models Charles J. Geyer School of Statistics University

Stat 5102 Lecture Slides Deck 5 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 6 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 8 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 1 Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides: Deck 4 Bayesian Inference Charles J. Geyer School of Statistics

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models,

Stat 5102 Lecture Slides: Deck 8 Bootstrap Charles J. Geyer School of Statistics University of

Stat 5102 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Lady Duvera Picture Presentation Starboard side Bathing platform Bow Outside dining Bridge Deck

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

DECK REFEREE CLINIC PACIFIC SWIMMING OFFICIALS CLINIC OCTOBER 201 9 MICHAEL DAVIS DE DECK

MATH 12002 - CALCULUS I 2.6: Implicit Differentiation Professor Donald L. White Department of

What Can It Look Like in the Science Classroom? Jeremy Peacock, Science Northeast Georgia RESA

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

Multiple Nested Reductions of Single Data Modes as a Tool to Deal with Large Data Sets Iven Van

Generalized Derivatives Automatic Evaluation &amp; Implications for Algorithms Paul I. Barton,

Differentiating the Flipped Classroom Eric M. Carbaugh, PhD -

Chapter 7: Product Differentiation A1. Firms meet only once in the market. Relax A2. Products are

Differentiating Exponential Functions Lots of real world processes have behavior which can be

Generalized Derivatives Automatic Evaluation & Implications for Algorithms Paul I. Barton,