theory of statistical inference a lazy approach to
play

Theory of statistical inference: a lazy approach to obtaining - PowerPoint PPT Presentation

Theory of statistical inference: a lazy approach to obtaining asymptotic results in parametric models Hien D. Nguyen 1 , 2 1 DECRA Research Fellow, Australian Research Council. 2 Lecturer, Department of Mathematics and Statistics, La Trobe


  1. Theory of statistical inference: a lazy approach to obtaining asymptotic results in parametric models Hien D. Nguyen 1 , 2 1 DECRA Research Fellow, Australian Research Council. 2 Lecturer, Department of Mathematics and Statistics, La Trobe University, Melbourne Australia. (Contact–Email: h.nguyen5@latrobe.edu.au, Twitter: @tresbienhien, Website: hiendn.github.io) S4D, Caen, 2018 June 21 1 / 67

  2. Framework Suppose that we observe { Z i } from some data generating process (DGP). i ∈ { 1 ,..., n } . Define a function Q n ( θ θ θ ) that depends on { Z i } . θ θ θ ∈ Θ , where Θ is a subset of a Euclidean space. θ We call Q n the objective function and θ θ the parameter vector . We say that Θ is the parameter space . 2 / 67

  3. Extremum estimation Following the nomenclature of Amemiya (1985), we say that the vector θ θ θ 0 ≡ argmax θ ∈ Θ Q ( θ θ θ ) θ θ is the extremum parameter of Q , where n − 1 Q n → Q in some sense (to be defined). We call ˆ θ θ θ θ n ≡ argmax θ ∈ Θ Q n ( θ θ ) θ θ the extremum estimator of θ θ θ 0 . 3 / 67

  4. A rose by any other name... We call the process of obtaining the extremum estimator: extremum estimation . Extremum estimation has appeared in the literature under numerous names: Empirical risk minimization (Vapnik, 1998, 2000). M-estimation (Huber, 1964; Serfling, 1980). Minimum contrast estimation (Pfanzagl, 1969; Bickel and Docksum, 2000). 4 / 67

  5. Some specific cases Important cases include: Generalized method of moments. Loss function minimization (e.g. fitting support vector machines, neural networks, etc.). Maximum likelihood estimation (including empirical-, partial-, penalized-, pseudo-, quasi-, restricted-, etc). Maximum a posteriori estimation. Minimum distance estimation (e.g. least-squares, least-absolute deviation, etc). 5 / 67

  6. Statistical inference θ Since θ θ 0 is defined as the maximum of Q , it must contain some information regarding the DGP of { Z i } . 1. We hope that given Q n , ˆ θ θ θ n will provide us with the same information regarding Q , provided that n is large enough. 2. We also hope that ˆ θ θ θ n also has some DGP that is dependent on θ θ θ 0 , which allows us to assess a priori hypotheses regarding θ θ θ . 6 / 67

  7. Ordinary least squares (1A) Suppose that we observe independent and identically distributed (IID) data pairs Z i = ( X i , Y i ) , where Y i = X ⊤ i θ θ θ ∗ + E i , where E ( E i ) = 0, and that the DGP of Z i is in some sense, well-behaved. θ ∗ ∈ Θ ⊂ R p and X i ∈ X ⊂ R p , p ∈ N , and { E i } is θ θ independent of { X i } . Define the (negative) sum-of-squares as n θ ) = − 1 � 2 � Y i − X ⊤ θ ∑ θ Q n ( θ i θ θ . 2 i = 1 The least-squares estimator is defined as n θ ∈ Θ − 1 � 2 � ˆ Y i − X ⊤ ∑ θ θ θ n ≡ argmax i θ θ θ . 2 θ θ i = 1 7 / 67

  8. Ordinary least squares (1B) We can obtain ˆ θ θ θ n by solving the first-order condition (FOC) n � � Y i − X ⊤ ∑ θ ∇ Q n = X i i θ θ = 0 i = 1 n n X i X ⊤ ∑ ∑ = ⇒ i θ θ θ = X i Y i i = 1 i = 1 � − 1 n � n ⇒ ˆ X i X ⊤ ∑ ∑ θ = θ n = θ X i Y i . i i = 1 i = 1 More familiarly, if we put X ⊤ into the i th row of X n ∈ R n × p i and put Y i into the i th position of y n ∈ R n , then we can write � − 1 � ˆ X ⊤ X ⊤ θ θ n = θ n X n n y n . 8 / 67

  9. Ordinary least squares (1C) Since ˆ θ θ θ n is an estimate of θ θ θ 0 , we must determine if there is a sensible relationship between Q n and θ θ θ 0 . p The following is a heuristic argument. Note that − → denotes convergence in probability . 1. Notice that n − 1 Q n = n − 1 ∑ n i = 1 g ( Z i ) , for some g ( Z i ) = − 1 � 2 � Y i − X ⊤ i θ θ θ . 2 2. Since Z i is well-behaved, then a weak law of large numbers implies that − 1 �� � 2 � p n − 1 Q n Y i − X ⊤ − → E [ g ( Z i )] = i θ θ θ 2 E ≡ Q 9 / 67

  10. Ordinary least squares (1D) 3. Suppose that we can exchange integration and differentiation, then the FOC implies that � � �� Y i − X ⊤ ∇ Q = i θ θ θ E X i � � �� X ⊤ θ ∗ + E i − X ⊤ θ θ = E X i i θ i θ θ � � � � X i X ⊤ X i X ⊤ = θ θ θ ∗ + E ( X i E i ) − E θ θ θ E i i 4. Under the assumption that E ( X i E i ) = 0 (e.g. independence between { X i } and { E i } ), we have � � � � X i X ⊤ X i X ⊤ 0 = θ θ θ ∗ − E θ θ θ E i i = ⇒ θ θ θ 0 = argmax θ ∈ Θ Q = θ θ θ ∗ θ θ θ Thus, in this case, we have found that θ θ 0 is the generative parameter θ θ θ ∗ ! 10 / 67

  11. Consistency We must now make precise the notion regarding how ˆ θ θ θ n and θ θ θ 0 are related. p Earlier, we defined − → to denote convergence in probabilit y . We say that a random variable U n converges in probability to another random variable U , if for every ε > 0, we have n → ∞ P ( � U n − U � > ε ) = 0, lim where �·� is some appropriate norm (usually Euclidean, in our case). p We say that ˆ θ 0 , if ˆ θ θ θ n is a consistent estimator of θ θ θ θ θ n − → θ θ θ 0 . 11 / 67

  12. Proving consistency (1) We present the consistency result of Amemiya (1985, Thm. 4.1.1). See also van der Vaart (1998, Thm. 5.7). Make the following assumptions: (A) The parameter space Θ is a compact subset of a Euclidean space R p ( p ∈ N ). (B) Q n ( θ θ θ ) is a continuous function in θ θ θ for all { Z i } , and measurable in { Z i } for all θ θ θ . (C) n − 1 Q n ( θ θ θ ) converges to a non-stochastic function Q ( θ θ θ ) in probability uniformly in θ θ θ over Θ . (D) Q ( θ θ θ ) obtains a unique global maximum at θ θ θ 0 . 12 / 67

  13. Proving consistency (2) Under Assumptions (A)–(D), then the EE, defined as ˆ θ θ θ θ n ≡ argmax θ ∈ Θ Q n ( θ θ ) , θ θ p is consistent, in the sense that ˆ θ θ θ n − → θ θ θ 0 . Here, we say that n − 1 Q n ( θ θ ) converges in probability θ uniformly to Q ( θ θ θ ) , if for any ε > 0 � � � n − 1 Q n ( θ � > ε � � lim sup θ θ ) − Q ( θ θ θ ) = 0. n → ∞ P θ θ ∈ Θ θ 13 / 67

  14. Uniform weak law of large numbers The most difficult part, in general, of applying Amemiya (1985, Thm. 4.1.1) is checking assumption (C). The main traditional tool that we will apply is the weak uniform law of large numbers of Jennrich (1969) (see also Amemiya, 1985, Thm. 4.2.1): θ ) = ∑ n Let Q n ( θ θ i = 1 g ( Z i ; θ θ θ ) be a measurable function of the IID sequence { Z i } , where Z i is supported in a Euclidean space, for each θ θ θ ∈ Θ , where Θ is compact and Euclidean. If E [ g ( Z i ; θ θ θ )] θ )] < ∞ , then n − 1 Q n ( θ exists, and E [ sup θ θ ∈ Θ g ( Z i ; θ θ θ θ ) converges in θ probability uniformly to Q ( θ θ θ ) = E [ g ( Z i ; θ θ θ )] . 14 / 67

  15. Ordinary least squares (2A) Make the following assumptions: (a) { Z i } is and IID sequence and that the DGP of Z i = ( X i , Y i ) is X i X ⊤ � � such that E exists and is positive definite, E ( E i ) = 0, i = σ 2 < ∞ , and E ( X i E i ) = 0 , where E 2 � � E i Y i = X ⊤ i θ θ θ ∗ + E i . (b) The parameter space is Θ = [ − L , L ] p , where L is sufficiently large. 15 / 67

  16. Ordinary least squares (2B) By (b), Θ is a compact Euclidean space, thus (A) is validated. θ ) = ∑ n We can write Q n ( θ θ i = 1 g ( Z i ; θ θ θ ) , where � 2 � Y i − X ⊤ = Y 2 i + Y i X ⊤ θ ⊤ X i X ⊤ θ θ θ θ − 2 g = i θ θ i θ θ − θ i θ θ and �� � 2 � � � Y i − X ⊤ Y 2 Y i X ⊤ � � i θ θ θ = − 2 E θ θ θ E E i i � � θ ⊤ E X i X ⊤ + θ θ θ θ θ . i 16 / 67

  17. Ordinary least squares (2C) Continuing from the previous slide, and applying (a), we have: �� � 2 � � � � � Y i − X ⊤ θ ⊤ X i X ⊤ E i X ⊤ i θ θ θ = θ θ θ θ θ ∗ + 2 E θ θ θ ∗ E ∗ E i i � � θ ⊤ E E 2 X i X ⊤ � � + E − 2 θ θ θ θ ∗ θ i i � � � � E i X ⊤ θ ⊤ E X i X ⊤ θ θ θ − 2 E θ θ + θ θ θ i i � � � � θ ⊤ X i X ⊤ θ ⊤ E X i X ⊤ = θ θ θ θ θ ∗ − 2 θ θ θ ∗ θ θ ∗ E i i � � θ ⊤ E X i X ⊤ θ + σ 2 . θ θ + θ θ i X i X ⊤ � � Since E exists, Q n is measurable, and g is quadratic i θ in θ θ , thus it is continuous and we have the validation of (B). 17 / 67

  18. Ordinary least squares (2D) Write Q n = ∑ n i = 1 g ( Z i ; θ θ θ ) , where θ ) = − 1 � 2 � Y i − X ⊤ θ θ g i ( Z i ; θ i θ θ . 2 From the previous slide, we have the fact that − 1 � � � � θ ⊤ X i X ⊤ θ ⊤ E X i X ⊤ θ θ θ θ θ E [ g ( Z i ; θ θ )] = 2 θ ∗ E θ θ ∗ + θ θ θ ∗ i i − 1 � � θ ⊤ E X i X ⊤ θ − σ 2 . θ θ 2 θ θ i By (b), Θ is compact, and we have established that g is continuous. Thus, via the Weierstrass extreme value theorem, � � θ E sup g ( Z i ; θ θ ) ≤ M < ∞ . θ θ θ ∈ Θ 18 / 67

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend