Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large - PowerPoint PPT Presentation

Cauchy Distribution Violates the LLN (cont.) This gives the trivial convergence in distribution result D X n − → Cauchy( µ, σ ) “trivial” because the left-hand side has the Cauchy( µ, σ ) distribution for all n . When thought of as being about distributions rather than variables (which it is), this is a constant sequence which has a trivial limit (limit of a constant sequence is that constant). 29

Cauchy Distribution Violates the LLN (cont.) The result D X n − → Cauchy( µ, σ ) is not convergence in distribution to a constant, the right-hand side not being a constant random variable. It is not surprising that the LLN which specifies P − → E ( X 1 ) X n does not hold because the mean E ( X 1 ) does not exist in this case. 30

Cauchy Distribution Violates the LLN (cont.) What is surprising is that X n does not get closer to µ as n increases. We saw (deck 2, slides 113–123) that when second moments exist we actually have X n − µ = O p ( n − 1 / 2 ) When only first moments exist, we only have the weaker statement (the LLN) X n − µ = o p (1) But here in the Cauchy case, where not even first moments exist, we have only the even weaker statement X n − µ = O p (1) which doesn’t say X n − µ decreases in any sense. 31

The Central Limit Theorem When we second moments exist, we actually have something much stronger than X n − µ = O p ( n − 1 / 2 ) . If X 1 , X 2 , . . . are IID random variables having mean µ and variance σ 2 , then √ n ( X n − µ ) D → N (0 , σ 2 ) − This fact is called the central limit theorem (CLT). The CLT is much too hard to prove in this course. 32

The Central Limit Theorem (cont.) When √ n ( X n − µ ) D → N (0 , σ 2 ) − holds, the jargon says one has “asymptotic normality”. When X n − µ = O p ( n − 1 / 2 ) holds, the jargon says one has “root n rate”. It is not necessary to have independence or identical distribution to get asymptotic normality. It also holds in many examples, such as the ones we looked at in deck 2 where one has root n rate. But precise conditions when asymptotic normality obtains without IID are beyond the scope of this course. 33

The Central Limit Theorem (cont.) The CLT also has a “sloppy version”. If √ n ( X n − µ ) actually had exactly the N (0 , σ 2 ) distribution, then X n itself would have the N ( µ, σ 2 /n ) distribution. This leads to the statement µ, σ 2 � � X n ≈ N n where the ≈ means something like approximately distributed as, although it doesn’t precisely mean anything. The correct mathematical statement is given on the preceding slide. The “sloppy” version cannot be a correct mathematical statement because a limit as n → ∞ cannot have an n in the putative limit. 34

The CLT and Addition Rules Any distribution that has second moments and appears as the distribution of the sum of IID random variables (an “addition rule”) is approximately normal when the number of terms in the sum is large. Bin( n, p ) is approximately normal when n is large and neither np or n (1 − p ) is near zero. NegBin( r, p ) is approximately normal when r is large and neither rp or r (1 − p ) is near zero. Poi( µ ) is approximately normal when µ is large. Gam( α, λ ) is approximately normal when α is large. 35

The CLT and Addition Rules (cont.) Suppose X 1 , X 2 , . . . are IID Ber( p ) and Y = X 1 + · · · + X n , so Y is Bin( n, p ). Then the CLT says � � p, p (1 − p ) X n ≈ N n and � � Y = nX n ≈ N np, np (1 − p ) by the continuous mapping theorem (which we will cover in slides 46–49, this deck). The disclaimer about neither np or n (1 − p ) are near zero comes from the fact that if np → µ we get the Poisson approximation, not the normal approximation and if n (1 − p ) → µ we get a Poisson approximation for n − Y . 36

The CLT and Addition Rules (cont.) Suppose X 1 , X 2 , . . . are IID Gam( α, λ ) and Y = X 1 + · · · + X n , so Y is Gam( nα, λ ). Then the CLT says � α α � X n ≈ N λ, nλ 2 and � nα λ , nα � Y = nX n ≈ N λ 2 by the continuous mapping theorem. Writing β = nα , we see that if Y is Gam( β, λ ) and β is large, then � β λ, β � Y ≈ N λ 2 37

Correction for Continuity A trick known as “continuity correction” improves normal approximation for integer-valued random variables. Suppose X has an integer-valued distribution. For a concrete example, take Bin(10 , 1 / 3), which has mean and variance E ( X ) = np = 10 3 var( X ) = np (1 − p ) = 20 9 38

Correction for Continuity (cont.) Binomial PDF (black) and normal approx. (red) 1.0 ● ● ● ● ● ● 0.8 ● 0.6 ● F(x) 0.4 ● 0.2 ● 0.0 ● 0 2 4 6 8 10 x Approximation is best at the points where the red curve crosses the black steps, approximately in the middle of each step. 39

Correction for Continuity (cont.) If X is an integer-valued random variable whose distribution is approximately that of Y , a normal random variable with the same mean and variance as X , and F is the DF of X and G is the DF of Y , then the correction for continuity says for integer x Pr( X ≤ x ) = F ( x ) ≈ G ( x + 1 / 2) and Pr( X ≥ x ) = 1 − F ( x − 1) ≈ 1 − G ( x − 1 / 2) so for integer a and b Pr( a ≤ X ≤ b ) ≈ Pr( a − 1 / 2 < Y < b + 1 / 2) = G ( b + 1 / 2) − G ( a − 1 / 2) 40

Correction for Continuity (cont.) Let’s try it. X is Bin(10 , 1 / 3), we calculate Pr( X ≤ 2) exactly, approximately without correction for continuity, and with correction for continuity > pbinom(2, 10, 1 / 3) [1] 0.2991414 > pnorm(2, 10 / 3, sqrt(20 / 9)) [1] 0.1855467 > pnorm(2.5, 10 / 3, sqrt(20 / 9)) [1] 0.2880751 The correction for continuity is clearly more accurate. 41

Correction for Continuity (cont.) Again, this time Pr( X ≥ 6) > 1 - pbinom(5, 10, 1 / 3) [1] 0.07656353 > 1 - pnorm(6, 10 / 3, sqrt(20 / 9)) [1] 0.03681914 > 1 - pnorm(5.5, 10 / 3, sqrt(20 / 9)) [1] 0.07305023 Again, the correction for continuity is clearly more accurate. 42

Correction for Continuity (cont.) Always use correction for continuity when random variable being approximated is integer-valued. Never use correction for continuity when random variable being approximated is continuous. Debatable whether to use correction for continuity when random variable being approximated is discrete, not integer-valued, but has a known relationship to an integer-valued random variable. 43

Infinitely Divisible Distributions A distribution is said to be infinitely divisible if for any positive integer n the distribution is that of the sum of n IID random variables. For example, the Poisson distribution is infinitely divisible because Poi( µ ) is the distribution of the sum of n IID Poi( µ/n ) random variables. 44

Infinitely Divisible Distributions and the CLT Infinitely divisible distributions show what is wrong with the “sloppy” version of the CLT, which says the sum of n IID random variables is approximately normal whenever n is “large”. Poi( µ ) is always the distribution of the sum of n IID random variables for any n . Pick n as large as you please. But that cannot mean that every Poisson distribution is approximately normal. For small and moderate size µ , the Poi( µ ) distribution is not close to normal. 45

The Continuous Mapping Theorem Suppose D − → X X n and g is a function that is continuous on a set A such that Pr( X ∈ A ) = 1 . Then D g ( X n ) − → g ( X ) This fact is called the continuous mapping theorem . 46

The Continuous Mapping Theorem (cont.) The continuous mapping theorem is widely used with simple If σ > 0, then z �→ z/σ is continuous. functions. The CLT says √ n ( X n − µ ) D − → Y where Y is N (0 , σ 2 ). Applying the continuous mapping theorem we get √ n ( X n − µ ) → Y D − σ σ Since Y/σ has the standard normal distribution, we can rewrite this √ n ( X n − µ ) D − → N (0 , 1) σ 47

The Continuous Mapping Theorem (cont.) Suppose D − → X X n where X is a continuous random variable, so Pr( X = 0) = 0. Then the continuous mapping theorem implies 1 → 1 D − X n X The fact that x �→ 1 /x is not continuous at zero is not a problem, because this is allowed by the continuous mapping theorem. 48

The Continuous Mapping Theorem (cont.) As a special case of the preceding slide, suppose P − → a X n where a � = 0 is a constant. Then the continuous mapping theorem implies 1 → 1 P − X n a The fact that x �→ 1 /x is not continuous at zero is not a problem, because this is allowed by the continuous mapping theorem. 49

Slutsky’s Theorem Suppose ( X i , Y i ), i = 1, 2, . . . are random vectors and D − → X X n P − → a Y n where X is a random variable and a is a constant. Then D X n + Y n − → X + a D X n − Y n − → X − a D − → aX X n Y n and if a � = 0 D X n /Y n − → X/a 50

Slutsky’s Theorem (cont.) As an example of Slutsky’s theorem, we show that convergence in distribution does not imply convergence of moments. Let X have the standard normal distribution and Y have the standard Cauchy distribution, and define Z n = X + Y n By Slutsky’s theorem D Z n − → X But Z n does not have first moments and X has moments of all orders. 51

The Delta Method The “delta” is supposed to remind one of the ∆ y/ ∆ x woof about differentiation, since it involves derivatives. Suppose D n α ( X n − θ ) − → Y, where α > 0, and suppose g is a function differentiable at θ , then D → g ′ ( θ ) Y. n α [ g ( X n ) − g ( θ )] − 52

The Delta Method (cont.) The assumption that g is differentiable at θ means g ( θ + h ) = g ( θ ) + g ′ ( θ ) h + o ( h ) where here the “little oh” of h refers to h → 0 rather than h → ∞ . It refers to a term of the form | h | ψ ( h ) where ψ ( h ) → 0 as h → 0. And this implies n α [ g ( X n ) − g ( θ )] = g ′ ( θ ) n α ( X n − θ ) + n α o ( X n − θ ) and the first term on the right-hand side converges to g ′ ( θ ) Y by the continuous mapping theorem. 53

The Delta Method (cont.) By our discussion of “little oh” we can rewrite the second term on the right-hand side | n α ( X n − θ ) | ψ ( X n − θ ) and D | n α ( X n − θ ) | − → | Y | by the continuous mapping theorem. And P X n − θ − → 0 by Slutsky’s theorem (by an argument analogous to homework problem 11-6). Hence P ψ ( X n − θ ) − → 0 by the continuous mapping theorem. 54

The Delta Method (cont.) Putting this all together P | n α ( X n − θ ) | ψ ( X n − θ ) − → 0 by Slutsky’s theorem. Finally D n α [ g ( X n ) − g ( θ )] = g ′ ( θ ) n α ( X n − θ ) + n α o ( X n − θ ) → g ′ ( θ ) Y − by another application of Slutsky’s theorem. 55

The Delta Method (cont.) If X 1 , X 2 , . . . are IID Exp( λ ) random variables and n X n = 1 � X i , n i =1 then the CLT says √ n X n − 1 0 , 1 � � � � D − → N λ 2 λ We want to “turn this upside down”, applying the delta method with g ( x ) = 1 x g ′ ( x ) = − 1 x 2 56

The Delta Method (cont.) √ n X n − 1 � � D − → Y λ implies � 1 √ n = √ n � 1 � �� g ( X n ) − g − λ λ X n � 1 � D → g ′ − Y λ = − λ 2 Y 57

The Delta Method (cont.) Recall that in the limit − λ 2 Y the random variable Y had the N (0 , 1 /λ 2 ) distribution. Since a linear function of normal is normal, − λ 2 Y is normal with parameters E ( − λ 2 Y ) = − λ 2 E ( Y ) = 0 var( − λ 2 Y ) = ( − λ 2 ) 2 var( Y ) = λ 2 Hence we have finally arrived at � 1 √ n � D → N (0 , λ 2 ) − λ − X n 58

The Delta Method (cont.) Since we routinely use the delta method in the case where the rate is √ n and the limiting distribution is normal, it is worthwhile working out some details of that case. Suppose √ n ( X n − θ ) D → N (0 , σ 2 ) , − and suppose g is a function differentiable at θ , then the delta method says √ n [ g ( X n ) − g ( θ )] D 0 , [ g ′ ( θ )] 2 σ 2 � � − → N 59

The Delta Method (cont.) have the N (0 , σ 2 ) distribution, then the general delta Let Y method says √ n [ g ( X n ) − g ( θ )] D → g ′ ( θ ) Y − As in our example, g ′ ( θ ) Y is normal with parameters E { g ′ ( θ ) Y } = g ′ ( θ ) E ( Y ) = 0 var { g ′ ( θ ) Y } = [ g ′ ( θ )] 2 var( Y ) = [ g ′ ( θ )] 2 σ 2 60

The Delta Method (cont.) We can turn this into a “sloppy” version of the delta method. If θ, σ 2 � � X n ≈ N n then g ( θ ) , [ g ′ ( θ )] 2 σ 2 � � g ( X n ) ≈ N n 61

The Delta Method (cont.) In particular, if we start with the “sloppy version” of the CLT µ, σ 2 � � X n ≈ N n we obtain the “sloppy version” of the delta method g ( µ ) , [ g ′ ( µ )] 2 σ 2 � � g ( X n ) ≈ N n 62

The Delta Method (cont.) Be careful not to think of the last special case as all there is to the delta method, since the delta method is really much more general. The delta method turns one convergence in distribution result into another. The first convergence in distribution result need not be the CLT. The parameter θ in the general theorem need not be the mean. 63

Variance Stabilizing Transformations An important application of the delta method is variance stabilizing transformations. The idea is to find a function g such that the limit in the delta method D → g ′ ( θ ) Y n α [ g ( X n ) − g ( θ )] − has variance that does not depend on the parameter θ . Of course, the variance is var { g ′ ( θ ) Y } = [ g ′ ( θ )] 2 var( Y ) so for this problem to make sense var( Y ) must be a function of θ and no other parameters. Thus variance stabilizing transformations usually apply only to a distributions having a single parameter. 64

Variance Stabilizing Transformations (cont.) Write var θ ( Y ) = v ( θ ) Then we are trying to find g such that [ g ′ ( θ )] 2 v ( θ ) = c for some constant c , or, equivalently, c g ′ ( θ ) = v ( θ ) 1 / 2 The fundamental theorem of calculus assures us that any indefinite integral of the right-hand side will do. 65

Variance Stabilizing Transformations (cont.) The CLT applied to an IID Ber( p ) sequence gives √ n ( X n − p ) D � � − → N 0 , p (1 − p ) so our method says we need to find an indefinite integral of � p (1 − p ). The change of variable p = (1 + w ) / 2 gives c/ c dp c dw � � = 1 − w 2 = c asin( w ) + d � � p (1 − p ) where d , like c , is an arbitrary constant and asin denotes the arcsine function (inverse of the sine function). 66

Variance Stabilizing Transformations (cont.) Thus g ( p ) = asin(2 p − 1) , 0 ≤ p ≤ 1 is a variance stabilizing transformation for the Bernoulli distribution. We check this using 1 g ′ ( p ) = � p (1 − p ) so the delta method gives √ n [ g ( X n ) − g ( p )] D − → N (0 , 1) and the “sloppy” delta method gives g ( p ) , 1 � � g ( X n ) ≈ N n 67

Variance Stabilizing Transformations (cont.) It is important that the parameter θ in the discussion of variance stabilizing transformations is as it appears in convergence distribution result we start with D n α [ X n − θ ] − → Y In particular, if we start with the CLT √ n [ X n − µ ] D − → Y the “theta” must be the mean. We need to find an indefinite integral of v ( µ ) − 1 / 2 , where v ( µ ) is the variance expressed as a function of the mean , not some other parameter. 68

Variance Stabilizing Transformations (cont.) To see how this works, consider the Geo( p ) distribution with E ( X ) = 1 − p p var( X ) = 1 − p p 2 The usual parameter p expressed as a function of the mean is 1 p = 1 + µ and the variance expressed as a function of the mean is v ( µ ) = µ (1 + µ ) 69

Variance Stabilizing Transformations (cont.) Our method says we need to find an indefinite integral of the � function x �→ c/ x (1 + x ). According to Mathematica, it is g ( x ) = 2 asinh( √ x ) where asinh denotes the hyperbolic arc sine function, the inverse of the hyperbolic sine function sinh( x ) = e x − e − x 2 so � � � 1 + x 2 asinh( x ) = log x + 70

Variance Stabilizing Transformations (cont.) Thus g ( x ) = 2 asinh( √ x ) , 0 ≤ x < ∞ is a variance stabilizing transformation for the geometric distribution. We check this using 1 g ′ ( x ) = � x (1 + x ) 71

Variance Stabilizing Transformations (cont.) So the delta method gives � 2 1 − p   √ n � � �� 1 − p 1 − p D  0 , g ′ g ( X n ) − g − → N  p 2 p p   1 � 1 − p = N  0 ,   1 − p 1 + 1 − p � p 2  p p = N (0 , 1) and the “sloppy” delta method gives � � � � 1 − p , 1 g ( X n ) ≈ N g p n 72

Multivariate Convergence in Probability We introduce the following notation for the length of a vector � n � � x 2 x T x = � � � x � = i � i =1 where x = ( x 1 , . . . , x n ). Then we say a sequence X 1 , X 2 , . . . of random vectors (here subscripts do not indicate components) converges in probability to a constant vector a if P � X n − a � − → 0 which by the continuous mapping theorem happens if and only if P � X n − a � 2 − → 0 73

Multivariate Convergence in Probability (cont.) We write P − → a X n or X n − a = o p (1) to denote P � X n − a � 2 − → 0 74

Multivariate Convergence in Probability (cont.) Thus we have defined multivariate convergence in probability to a constant in terms of univariate convergence in probability to a constant. Now we consider the relationship further. Write X n = ( X n 1 , . . . , X nk ) a = ( a 1 , . . . , a k ) Then k � X n − a � 2 = ( X ni − a i ) 2 � i =1 so ( X ni − a i ) 2 ≤ � X n − a � 2 75

Multivariate Convergence in Probability (cont.) It follows that P − → a X n implies P X ni − → a i , i = 1 , . . . , k In words, joint convergence in probability to a constant (of random vectors) implies marginal convergence in probability to a constant (of each component of those random vectors). 76

Multivariate Convergence in Probability (cont.) Conversely, if we have P − → a i , X ni i = 1 , . . . , k then the continuous mapping theorem implies P ( X ni − a i ) 2 − → 0 , i = 1 , . . . , k and Slutsky’s theorem implies ( X n 1 − a 1 ) 2 + ( X n 2 − a 2 ) 2 P − → 0 and another application of Slutsky’s theorem implies ( X n 1 − a 1 ) 2 + ( X n 2 − a 2 ) 2 + ( X n 3 − a 3 ) 2 P − → 0 and so forth. So by mathematical induction, P � X n − a � 2 − → 0 77

Multivariate Convergence in Probability (cont.) In words, joint convergence in probability to a constant (of random vectors) implies and is implied by marginal convergence in probability to a constant (of each component of those random vectors). But multivariate convergence in distribution is different! 78

Multivariate Convergence in Distribution If X 1 , X 2 , . . . is a sequence of k -dimensional random vectors, and X is another k -dimensional random vector, then we say X n converges in distribution to X if E { g ( X n ) } → E { g ( X ) } , for all bounded continuous functions g : R k → R , and we write D − → X X n to indicate this. 79

Multivariate Convergence in Distribution (cont.) The Cram´ er-Wold theorem asserts that the following is an equiv- alent characterization of multivariate convergence in distribution. D − → X X n if and only if D a T X n → a T X − for every constant vector a (of the same dimension as the X n and X ). 80

Multivariate Convergence in Distribution (cont.) Thus we have defined multivariate convergence in distribution in terms of univariate convergence in distribution. If we use vectors a having only the one component nonzero in the Cram´ er-Wold theorem we see that joint convergence in distribution (of random vectors) implies marginal convergence in distribution (of each component of those random vectors). But the converse is not, in general, true! 81

Multivariate Convergence in Distribution (cont.) Here is a simple example where marginal convergence in distribution holds but joint convergence in distribution fails. Define � � X n 1 X n = X n 2 where X n 1 is standard normal for all n and X n 2 = ( − 1) n X n 1 (hence is also standard normal for all n ). Trivially, D − → N (0 , 1) , i = 1 , 2 X ni so we have marginal convergence in distribution. 82

Multivariate Convergence in Distribution (cont.) But checking a = (1 , 1) in the Cram´ er-Wold condition we get  � � � 2 X n 1 , n even X n 1  a T X = � = X n 1 [1 + ( − 1) n ] = 1 1 X n 2 0 , n odd  And this sequence does not converge in distribution, so we do not have joint convergence in distribution, that is, D − → Y X n cannot hold, not for any random vector Y . 83

Multivariate Convergence in Distribution (cont.) In words, joint convergence in distribution (of random vectors) implies but is not implied by marginal convergence in distribution (of each component of those random vectors). 84

Multivariate Convergence in Distribution (cont.) There is one special case where marginal convergence in distribution implies joint convergence in distribution. This is when the components of the random vectors are independent. Suppose D − → Y i , i = 1 , . . . , k, X ni X n denotes the random vector having independent components X n 1 , . . . , X nk , and Y denotes the random vector having independent components Y 1 , . . . , Y k . Then D − → Y X n (again we do not have the tools to prove this). 85

The Multivariate Continuous Mapping Theorem Suppose D − → X X n and g is a function that is continuous on a set A such that Pr( X ∈ A ) = 1 . Then D − → g ( X ) g ( X n ) This fact is called the continuous mapping theorem . Here g may be a function that maps vectors to vectors. 86

Multivariate Slutsky’s Theorem Suppose � � X n 1 X n = X n 2 are partitioned random vectors and D − → Y X n 1 P − → a X n 2 where Y is a random vector and a is a constant vector. Then � � Y D − → X n a where the joint distribution of the right-hand side is defined in the obvious way. 87

Multivariate Slutsky’s Theorem (cont.) By an argument analogous to that in homework problem 5-6, the constant random vector a is necessarily independent of the random vector Y , because a constant random vector is independent of any other random vector. Thus there is only one distribution the partitioned random vector � � Y a can have. 88

Multivariate Slutsky’s Theorem (cont.) In conjunction with the continuous mapping theorem, this more general version of Slutsky’s theorem implies the earlier version. For any function g that is continuous at points of the form � � y a we have D g ( X n 1 , X n 2 ) − → g ( Y , a ) 89

The Multivariate CLT Suppose X 1 , X 2 , . . . is an IID sequence of random vectors having mean vector µ and variance matrix M and n X n = 1 � X i n i =1 Then √ n ( X n − µ ) D − → N (0 , M ) which has “sloppy version” � � µ , M X n ≈ N n 90

The Multivariate CLT (cont.) The multivariate CLT follows from the univariate CLT and the Cram´ er-Wold theorem. a T � √ n ( X n − µ ) = √ n ( a T X n − a T µ ) D � → N (0 , a T Ma ) − because E ( a T X n ) = a T µ var( a T X n ) = a T Ma and because, if Y has the N (0 , M ) distribution, then a T Y has the N (0 , a T Ma ) distribution. 91

Normal Approximation to the Multinomial The Multi( n, p ) distribution is the sum of n IID random vectors having mean vector p and variance matrix P − pp T , where P is diagonal and its diagonal components are the components of p in the same order (deck 5, slide 83). Thus the multivariate CLT (“sloppy” version) says � n p , n ( P − pp T ) � Multi( n, p ) ≈ N when n is large and np i is not close to zero for any i , where p = ( p 1 , . . . , p k ). Note that both sides are degenerate. On both sides we have the property that the components of the random vector in question sum to n . 92

The Multivariate CLT (cont.) Recall the notation (deck 3, slide 151) for ordinary moments α i = E ( X i ) , consider a sequence X 1 , X 2 , . . . of IID random variables having moments of order 2 k , and define the random vectors   X n X 2   n   Y n = . .   .     X k n Then   α 1 α 2   E ( Y n ) =  .  . .     α k 93

The Multivariate CLT (cont.) And the i, j component of var( Y n ) is cov( X i n , X j n ) = E ( X i n X j n ) − E ( X i n ) E ( X j n ) = α i + j − α i α j so   α 2 − α 2 α 3 − α 1 α 2 · · · α k +1 − α 1 α k 1 α 4 − α 2  α 3 − α 1 α 2 · · · α k +2 − α 2 α k  2   var( Y n ) = . . . ... . . .   . . .     α 2 k − α 2 α k +1 − α 1 α k α k +2 − α 2 α k · · · k Because of the assumption that moments of order 2 k exist, E ( Y n ) and var( Y n ) exist. 94

The Multivariate CLT (cont.) Define n Y n = 1 � Y i n i =1 Then the multivariate CLT says √ n ( Y n − µ ) D − → N (0 , M ) where E ( Y n ) = µ var( Y n ) = M 95

Multivariate Differentiation A function g : R d → R k is differentiable at a point x if there exists a matrix B such that g ( x + h ) = g ( x ) + Bh + o ( � h � ) in which case the matrix B is unique and is called the derivative of the function g at the point x and is denoted ∇ g ( x ), read “del g of x ”. 96

Multivariate Differentiation (cont.) A sufficient but not necessary condition for the function � � x = ( x 1 , . . . , x d ) �→ g ( x ) = g 1 ( x ) , . . . , g k ( x ) to be differentiable at a point y is that all of the partial derivatives ∂g i ( x ) /∂x j exist and are are continuous at x = y , in which case  ∂g 1 ( x ) ∂g 1 ( x ) ∂g 1 ( x )  · · · ∂x 1 ∂x 2 ∂x d   ∂g 2 ( x ) ∂g 2 ( x ) ∂g 2 ( x ) · · ·    ∂x 1 ∂x 2 ∂x d  ∇ g ( x ) =  . . .  ... . . . . . .     ∂g k ( x ) ∂g k ( x ) ∂g k ( x )   · · · ∂x 1 ∂x 2 ∂x d Note that ∇ g ( x ) is k × d , as it must be in order for [ ∇ g ( x )] h to make sense when h is k × 1. 97

Multivariate Differentiation (cont.) Note also that ∇ g ( x ) is the matrix whose determinant is the Jacobian determinant in the multivariate change-of-variable for- mula. For this reason it is sometimes called the Jacobian matrix . 98

The Multivariate Delta Method The multivariate delta method is just like the univariate delta method. The proofs are analogous. Suppose D n α ( X n − θ ) − → Y , where α > 0, and suppose g is a function differentiable at θ , then D n α [ g ( X n ) − g ( θ )] − → [ ∇ g ( θ )] Y . 99

The Multivariate Delta Method (cont.) Since we routinely use the delta method in the case where the rate is √ n and the limiting distribution is normal, it is worthwhile working out some details of that case. Suppose √ n ( X n − θ ) D − → N (0 , M ) , and suppose g is a function differentiable at θ , then the delta method says √ n [ g ( X n ) − g ( θ )] D � 0 , BMB T � − → N , where B = ∇ g ( θ ) . 100

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large - PowerPoint PPT Presentation

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer School of Statistics University of Minnesota 1 Asymptotic Approximation The last big subject in probability theory is asymptotic approxi- mation,

Stat 5101 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 6 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 1 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 1 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides: Deck 5 Conditional Probability and Expectation, Poisson Process,

Stat 5101 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 6 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 5 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides: Deck 1 Probability and Expectation on Finite Sample Spaces Charles J.

Stat 5101 Lecture Slides: Deck 8 Dirichlet Distribution Charles J. Geyer School of Statistics

Stat 5101 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides: Deck 6 Existence of Integrals and Infinite Sums, Countable Additivity

Stat 5101 Lecture Slides Deck 4 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides Deck 5 Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides: Deck 4 Quantiles and Best Prediction Charles J. Geyer School of

Lady Duvera Picture Presentation Starboard side Bathing platform Bow Outside dining Bridge Deck

Introductory Lecture 2 Suzanne Lenhart University of Tennessee, Knoxville Departments of

Mining, network, economics Joseph Bonneau Recap: Bitcoin miners

Precision Oncology Trials: Big Hope, Big Challenges. Yuan Ji Department of Public Health

1 A pig (an animal) Pig production = N a pig? Medicine Medicine Piglets Piglets Meat Meat

Approximate Knowledge Compilation by Online Collapsed Importance Sampling Tal Friedman and Guy

1 Prior Sampling Prior Sampling For i=1, 2, , n +c 0.5 -c 0.5 Sample x i from P(X i

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone