Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer - PowerPoint PPT Presentation

Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota June 7, 2015

The Delta Method The delta method is a method (duh!) of deriving the approximate distribution of a nonlinear function of an estimator from the approximate distribution of the estimator itself. What it does is linearize the nonlinear function. If g is a nonlinear, differentiable vector-to-vector function, the best linear approximation, which is the Taylor series up through linear terms, is g ( y ) − g ( x ) ≈ ∇ g ( x )( y − x ) , where ∇ g ( x ) is the matrix of partial derivatives, sometimes called the Jacobian matrix. If g i ( x ) denotes the i -th component of the vector g ( x ), then the ( i , j )-th component of the Jacobian matrix is ∂ g i ( x ) /∂ x j .

The Delta Method (cont.) The delta method is particularly useful when ˆ θ is an estimator and θ is the unknown true (vector) parameter value it estimates, and the delta method says g (ˆ θ ) − g ( θ ) ≈ ∇ g ( θ )(ˆ θ − θ ) It is not necessary that θ and g ( θ ) be vectors of the same dimension. Hence it is not necessary that ∇ g ( θ ) be a square matrix.

The Delta Method (cont.) The delta method gives good or bad approximations depending on whether the spread of the distribution of ˆ θ − θ is small or large compared to the nonlinearity of the function g in the neighborhood of θ . The Taylor series approximation the delta method uses is a good approximation for sufficiently small values of ˆ θ − θ and a bad approximation for sufficiently large values of ˆ θ − θ . So the overall method is good if those“sufficiently large”values have small probability. And bad otherwise.

The Delta Method (cont.) As with nearly every application of approximation in statistics, we rarely (if ever) do the (very difficult) analysis to know whether the approximation is good or bad. We just use the delta method and hope it gives good results. If we are really worried, we can check it using simulation (also called the parametric bootstrap ).

The Delta Method (cont.) The delta method is particularly easy to use when the distribution of ˆ θ − θ is multivariate normal, exactly or approximately. If it is only approximately normal, then this is another approximation in addition to the Taylor series approximation. The reason this is easy is that normal distributions are determined by their mean vector and variance matrix, and there is a theorem which gives the mean vector and variance matrix of a linear function of a random vector.

The Delta Method (cont.) Theorem. Suppose X is a random vector, a is a nonrandom vector, and B is a nonrandom matrix such that a + BX makes sense (because a , B , and X have dimensions such that the indicated vector addition and matrix-vector multiplication are defined). Then E ( a + bX ) = a + BE ( X ) var( a + bX ) = B var( X ) B T A proof is given on slides 64–67 of deck 2 of my Stat 5101 course slides. Another way to say this is that if E ( X ) = µ and var( X ) = V , then E ( a + bX ) = a + B µ var( a + bX ) = BVB T

The Delta Method (cont.) So suppose ˆ θ is normal with mean vector θ and variance matrix V , and write B = ∇ g ( θ ), then ˆ θ − θ has mean vector 0 and variance matrix V , and g (ˆ � � E θ ) − g ( θ ) ≈ 0 g (ˆ ≈ BVB T � � var θ ) − g ( θ )

The Delta Method (cont.) The Delta Method for Approximately Normal Estimators. Suppose ˆ θ is approximately normal with mean vector θ and variance matrix V ( θ ). Suppose g is a vector-to-vector function with derivative ∇ g ( θ ) = B ( θ ). Then g (ˆ θ ) is approximately normal with mean vector g ( θ ) and variance matrix B ( θ ) V ( θ ) B ( θ ) T .

The Delta Method (cont.) An approximate confidence region for g ( θ ) is centered at g (ˆ θ ) and has extent determined by B ( θ ) V ( θ ) B ( θ ) T . but we do not know that because we do not know θ (the true unknown parameter value). Thus we make a last approximation and plug-in ˆ θ for θ in the variance and use B (ˆ θ ) V (ˆ θ ) B (ˆ θ ) T . This is known as the plug-in principle . (For the statisticians in the audience, it is an application of Slutsky’s theorem.)

The Delta Method (cont.) Recall from deck 2 of these slides that the maximum likelihood estimator in an unconditional canonical affine submodel of an aster model can be written β = h − 1 ( M T y ) ˆ where h is the transformation from canonical to mean value parameters given by h ( β ) = ∇ c sub ( β ) = M T ∇ c ( a + M β ) and has derivative ∇ h ( β ) = ∇ 2 c sub ( β ) = M T ∇ 2 c ( a + M β ) M

The Delta Method (cont.) And by the inverse function theorem of real analysis, the derivative of the inverse function is the (matrix) inverse of the derivative of the forward function � − 1 , ∇ h − 1 ( τ ) = when τ = h ( β ) and β = h − 1 ( τ ) . � ∇ h ( β )

Fisher Information The matrix that appeared in the derivative of the canonical-to-mean-value parameter map plays a very important role in likelihood inference. The observed Fisher information matrix is minus the second derivative matrix of the log likelihood. The expected Fisher information matrix is the expectation of the observed Fisher information matrix.

Fisher Information (cont.) What Fisher information is depends on what the parameter is (what you are differentiating with respect to). It also depends on what the model is (what the log likelihood is). Thus, to be pedantically correct, we need decoration to indicate observed or expected, the model, and the parameter Sometimes we are not so fussy and let the context indicate what we mean.

Fisher Information (cont.) For log likelihood l for parameter ϕ , observed Fisher information (for this model and parameter) is I obs ( ϕ ) = −∇ 2 l ( ϕ ) and expected Fisher information (for this model and parameter) is I exp ( ϕ ) = E ϕ { I obs ( ϕ ) } = E ϕ {−∇ 2 l ( ϕ ) }

Fisher Information (cont.) If this is the log likelihood for a full exponential family l ( ϕ ) = � y , ϕ � − c ( ϕ ) , then I obs ( ϕ ) = −∇ 2 l ( ϕ ) = ∇ 2 c ( ϕ ) and since this is a nonrandom quantity, it is its own expectation (expectation of a constant is that constant), so I exp ( ϕ ) = ∇ 2 c ( ϕ ) too.

Fisher Information (cont.) Thus for a full exponential family, in general, and for saturated aster models and their unconditional canonical affine submodels, in particular, there is no difference between observed and expected Fisher information for the unconditional canonical parameter, and we can just write I ( ϕ ) = ∇ 2 c ( ϕ )

Fisher Information (cont.) But even restricting to Fisher information for the unconditional canonical parameter, we distinguish Fisher information for saturated models and canonical affine submodels I sat ( ϕ ) = ∇ 2 c ( ϕ ) I sub ( β ) = ∇ 2 c sub ( β ) = M T ∇ 2 c ( a + M β ) M

Fisher Information (cont.) To figure out Fisher information for other parameters, there are two ways to go: Write the log likelihood in terms of the new parameter, differentiate it twice, negate it, and take an expectation, if expected Fisher information is wanted. Prove a theorem about how Fisher information transforms under change-of-parameter. (The latter is just the former done abstractly and once and for all, rather than concretely and repeated for each problem.)

Fisher Information Transforms by Covariance If ψ is another parameter, then ∂ l ( ψ ) ∂ l ( ϕ ) ∂ϕ k � = ∂ψ i ∂ϕ k ∂ψ i k (the multivariable chain rule), and ∂ 2 l ( ψ ) ∂ 2 l ( ϕ ) ∂ 2 ϕ k ∂ϕ k ∂ϕ l ∂ l ( ϕ ) � � � = + ∂ψ i ∂ψ j ∂ϕ k ∂ϕ l ∂ψ i ∂ψ j ∂ϕ k ∂ψ i ∂ψ j k l k This is somewhat ugly. But if we plug in the MLE for ϕ , the second term is zero because ∇ l ( ˆ ϕ ) = 0 (the first derivative is zero at the maximum). The second term also goes away for expected Fisher information because E ϕ {∇ l ( ϕ ) } = 0 by a differentiation under the integral sign argument proved in theoretical statistics courses (slides 33–35 and 86 of my 5102 course slides).

Fisher Information Transforms by Covariance (cont.) This gives the tranformation rules I exp ,ψ ( ψ ) = B ( ψ ) T I exp ,ϕ ( ϕ ) B ( ψ ) where ϕ = h ( ψ ) B ( ψ ) = ∇ h ( ψ ) and I obs ,ψ ( ˆ ψ ) = B ( ˆ ψ ) T I obs ,ϕ ( ˆ ϕ ) B ( ˆ ψ ) ϕ = h ( ˆ with the same conditions and ˆ ψ ).

Fisher Information and MLE The so-called“usual”asymptotics of maximum likelihood says the asymptotic (large sample, approximate) distribution of the MLE is normal with mean vector the true unknown parameter value and variance inverse Fisher information (either observed or expected, but for that particular model and parameter). For full exponential families, this is an application of the delta method.

Fisher Information and MLE (cont.) Recall again (from just before we started talking about Fisher information) for a unconditional canonical affine submodel of an aster model β = h − 1 ( M T y ) ˆ where h ( β ) = ∇ c sub ( β ) = M T ∇ c ( a + M β ) ∇ h ( β ) = ∇ 2 c sub ( β ) = M T ∇ c ( a + M β ) M and � − 1 , ∇ h − 1 ( τ ) = when τ = h ( β ) and β = h − 1 ( τ ) . � ∇ h ( β )

Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer - PowerPoint PPT Presentation

Stat 8931 (Aster Models) Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota June 7, 2015 The Delta Method The delta method is a method (duh!) of deriving the approximate distribution of a nonlinear function of

Stat 8931 (Aster Models) Lecture Slides Deck 5 Lande-Arnold Theory meets Aster Models Charles J.

Stat 8931 (Aster Models) Lecture Slides Deck 6 Aster Models with Random Effects Charles J. Geyer

Stat 8931 (Aster Models) Lecture Slides Deck 8 Conditional Aster Models Charles J. Geyer School

Stat 8931 (Aster Models) Lecture Slides Deck 3 Charles J. Geyer School of Statistics University

Stat 8931 (Aster Models) Lecture Slides Deck 6 Charles J. Geyer School of Statistics University

Stat 8931 (Aster Models) Lecture Slides Deck 9 Charles J. Geyer School of Statistics University

Stat 8931 (Aster Models) Lecture Slides Deck 7 Parametric Bootstrap Charles J. Geyer School of

Stat 8931 (Aster Models) Lecture Slides Deck 3 Using Model Matrices instead of Formulas Charles

Stat 8931 (Aster Models) Lecture Slides Deck 4 Large Sample Theory and Estimating Population

Stat 8931 (Aster Models) Lecture Slides Deck 9 Directions of Recession (Solutionsat

Stat 8931 (Aster Models) Lecture Slides Deck 5 Charles J. Geyer School of Statistics University

Stat 8931 (Aster Models) Lecture Slides Deck 7 Charles J. Geyer School of Statistics University

Aster Models Stat 8053 Lecture Notes Charles J. Geyer School of Statistics University of

Lady Duvera Picture Presentation Starboard side Bathing platform Bow Outside dining Bridge Deck

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

DECK REFEREE CLINIC PACIFIC SWIMMING OFFICIALS CLINIC OCTOBER 201 9 MICHAEL DAVIS DE DECK

Is there a Chinese Idea of a University? Simon Marginson Department of Education,

Welcome to Physics 102 This class is a survey of our universe as seen by modern science and an

Visions of the Multiverse S. Manly Dept. of Physics and Astronomy University of Rochester P102

Polygyny, Womens Rights and Development Mich` ele Tertilt Stanford University September 2005

Structured PVA Historical essay: for example history of protection of Everglades Concern: Run-off

Divorced women are less likely to remarry than divorced men. There are proportionately

References Chapt. 8 in the Ranavirus Book, sec8on on Models Supplemental informa8on has the

Herring Assessment and Fishery Management in Washington State June 8-10, 2015/Pacific Herring

Sambuz

Useful Links

Newsletter

Mail Us