stat 8931 aster models lecture slides deck 4
play

Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer - PowerPoint PPT Presentation

Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota June 7, 2015 The Delta Method The delta method is a method (duh!) of deriving the approximate distribution of a nonlinear function of


  1. Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota June 7, 2015

  2. The Delta Method The delta method is a method (duh!) of deriving the approximate distribution of a nonlinear function of an estimator from the approximate distribution of the estimator itself. What it does is linearize the nonlinear function. If g is a nonlinear, differentiable vector-to-vector function, the best linear approximation, which is the Taylor series up through linear terms, is g ( y ) − g ( x ) ≈ ∇ g ( x )( y − x ) , where ∇ g ( x ) is the matrix of partial derivatives, sometimes called the Jacobian matrix. If g i ( x ) denotes the i -th component of the vector g ( x ), then the ( i , j )-th component of the Jacobian matrix is ∂ g i ( x ) /∂ x j .

  3. The Delta Method (cont.) The delta method is particularly useful when ˆ θ is an estimator and θ is the unknown true (vector) parameter value it estimates, and the delta method says g (ˆ θ ) − g ( θ ) ≈ ∇ g ( θ )(ˆ θ − θ ) It is not necessary that θ and g ( θ ) be vectors of the same dimension. Hence it is not necessary that ∇ g ( θ ) be a square matrix.

  4. The Delta Method (cont.) The delta method gives good or bad approximations depending on whether the spread of the distribution of ˆ θ − θ is small or large compared to the nonlinearity of the function g in the neighborhood of θ . The Taylor series approximation the delta method uses is a good approximation for sufficiently small values of ˆ θ − θ and a bad approximation for sufficiently large values of ˆ θ − θ . So the overall method is good if those“sufficiently large”values have small probability. And bad otherwise.

  5. The Delta Method (cont.) As with nearly every application of approximation in statistics, we rarely (if ever) do the (very difficult) analysis to know whether the approximation is good or bad. We just use the delta method and hope it gives good results. If we are really worried, we can check it using simulation (also called the parametric bootstrap ).

  6. The Delta Method (cont.) The delta method is particularly easy to use when the distribution of ˆ θ − θ is multivariate normal, exactly or approximately. If it is only approximately normal, then this is another approximation in addition to the Taylor series approximation. The reason this is easy is that normal distributions are determined by their mean vector and variance matrix, and there is a theorem which gives the mean vector and variance matrix of a linear function of a random vector.

  7. The Delta Method (cont.) Theorem. Suppose X is a random vector, a is a nonrandom vector, and B is a nonrandom matrix such that a + BX makes sense (because a , B , and X have dimensions such that the indicated vector addition and matrix-vector multiplication are defined). Then E ( a + bX ) = a + BE ( X ) var( a + bX ) = B var( X ) B T A proof is given on slides 64–67 of deck 2 of my Stat 5101 course slides. Another way to say this is that if E ( X ) = µ and var( X ) = V , then E ( a + bX ) = a + B µ var( a + bX ) = BVB T

  8. The Delta Method (cont.) So suppose ˆ θ is normal with mean vector θ and variance matrix V , and write B = ∇ g ( θ ), then ˆ θ − θ has mean vector 0 and variance matrix V , and g (ˆ � � E θ ) − g ( θ ) ≈ 0 g (ˆ ≈ BVB T � � var θ ) − g ( θ )

  9. The Delta Method (cont.) The Delta Method for Approximately Normal Estimators. Suppose ˆ θ is approximately normal with mean vector θ and variance matrix V ( θ ). Suppose g is a vector-to-vector function with derivative ∇ g ( θ ) = B ( θ ). Then g (ˆ θ ) is approximately normal with mean vector g ( θ ) and variance matrix B ( θ ) V ( θ ) B ( θ ) T .

  10. The Delta Method (cont.) An approximate confidence region for g ( θ ) is centered at g (ˆ θ ) and has extent determined by B ( θ ) V ( θ ) B ( θ ) T . but we do not know that because we do not know θ (the true unknown parameter value). Thus we make a last approximation and plug-in ˆ θ for θ in the variance and use B (ˆ θ ) V (ˆ θ ) B (ˆ θ ) T . This is known as the plug-in principle . (For the statisticians in the audience, it is an application of Slutsky’s theorem.)

  11. The Delta Method (cont.) Recall from deck 2 of these slides that the maximum likelihood estimator in an unconditional canonical affine submodel of an aster model can be written β = h − 1 ( M T y ) ˆ where h is the transformation from canonical to mean value parameters given by h ( β ) = ∇ c sub ( β ) = M T ∇ c ( a + M β ) and has derivative ∇ h ( β ) = ∇ 2 c sub ( β ) = M T ∇ 2 c ( a + M β ) M

  12. The Delta Method (cont.) And by the inverse function theorem of real analysis, the derivative of the inverse function is the (matrix) inverse of the derivative of the forward function � − 1 , ∇ h − 1 ( τ ) = when τ = h ( β ) and β = h − 1 ( τ ) . � ∇ h ( β )

  13. Fisher Information The matrix that appeared in the derivative of the canonical-to-mean-value parameter map plays a very important role in likelihood inference. The observed Fisher information matrix is minus the second derivative matrix of the log likelihood. The expected Fisher information matrix is the expectation of the observed Fisher information matrix.

  14. Fisher Information (cont.) What Fisher information is depends on what the parameter is (what you are differentiating with respect to). It also depends on what the model is (what the log likelihood is). Thus, to be pedantically correct, we need decoration to indicate observed or expected, the model, and the parameter Sometimes we are not so fussy and let the context indicate what we mean.

  15. Fisher Information (cont.) For log likelihood l for parameter ϕ , observed Fisher information (for this model and parameter) is I obs ( ϕ ) = −∇ 2 l ( ϕ ) and expected Fisher information (for this model and parameter) is I exp ( ϕ ) = E ϕ { I obs ( ϕ ) } = E ϕ {−∇ 2 l ( ϕ ) }

  16. Fisher Information (cont.) If this is the log likelihood for a full exponential family l ( ϕ ) = � y , ϕ � − c ( ϕ ) , then I obs ( ϕ ) = −∇ 2 l ( ϕ ) = ∇ 2 c ( ϕ ) and since this is a nonrandom quantity, it is its own expectation (expectation of a constant is that constant), so I exp ( ϕ ) = ∇ 2 c ( ϕ ) too.

  17. Fisher Information (cont.) Thus for a full exponential family, in general, and for saturated aster models and their unconditional canonical affine submodels, in particular, there is no difference between observed and expected Fisher information for the unconditional canonical parameter, and we can just write I ( ϕ ) = ∇ 2 c ( ϕ )

  18. Fisher Information (cont.) But even restricting to Fisher information for the unconditional canonical parameter, we distinguish Fisher information for saturated models and canonical affine submodels I sat ( ϕ ) = ∇ 2 c ( ϕ ) I sub ( β ) = ∇ 2 c sub ( β ) = M T ∇ 2 c ( a + M β ) M

  19. Fisher Information (cont.) To figure out Fisher information for other parameters, there are two ways to go: Write the log likelihood in terms of the new parameter, differentiate it twice, negate it, and take an expectation, if expected Fisher information is wanted. Prove a theorem about how Fisher information transforms under change-of-parameter. (The latter is just the former done abstractly and once and for all, rather than concretely and repeated for each problem.)

  20. Fisher Information Transforms by Covariance If ψ is another parameter, then ∂ l ( ψ ) ∂ l ( ϕ ) ∂ϕ k � = ∂ψ i ∂ϕ k ∂ψ i k (the multivariable chain rule), and ∂ 2 l ( ψ ) ∂ 2 l ( ϕ ) ∂ 2 ϕ k ∂ϕ k ∂ϕ l ∂ l ( ϕ ) � � � = + ∂ψ i ∂ψ j ∂ϕ k ∂ϕ l ∂ψ i ∂ψ j ∂ϕ k ∂ψ i ∂ψ j k l k This is somewhat ugly. But if we plug in the MLE for ϕ , the second term is zero because ∇ l ( ˆ ϕ ) = 0 (the first derivative is zero at the maximum). The second term also goes away for expected Fisher information because E ϕ {∇ l ( ϕ ) } = 0 by a differentiation under the integral sign argument proved in theoretical statistics courses (slides 33–35 and 86 of my 5102 course slides).

  21. Fisher Information Transforms by Covariance (cont.) This gives the tranformation rules I exp ,ψ ( ψ ) = B ( ψ ) T I exp ,ϕ ( ϕ ) B ( ψ ) where ϕ = h ( ψ ) B ( ψ ) = ∇ h ( ψ ) and I obs ,ψ ( ˆ ψ ) = B ( ˆ ψ ) T I obs ,ϕ ( ˆ ϕ ) B ( ˆ ψ ) ϕ = h ( ˆ with the same conditions and ˆ ψ ).

  22. Fisher Information and MLE The so-called“usual”asymptotics of maximum likelihood says the asymptotic (large sample, approximate) distribution of the MLE is normal with mean vector the true unknown parameter value and variance inverse Fisher information (either observed or expected, but for that particular model and parameter). For full exponential families, this is an application of the delta method.

  23. Fisher Information and MLE (cont.) Recall again (from just before we started talking about Fisher information) for a unconditional canonical affine submodel of an aster model β = h − 1 ( M T y ) ˆ where h ( β ) = ∇ c sub ( β ) = M T ∇ c ( a + M β ) ∇ h ( β ) = ∇ 2 c sub ( β ) = M T ∇ c ( a + M β ) M and � − 1 , ∇ h − 1 ( τ ) = when τ = h ( β ) and β = h − 1 ( τ ) . � ∇ h ( β )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend