Theory of statistical inference: a lazy approach to obtaining - PowerPoint PPT Presentation

Theory of statistical inference: a lazy approach to obtaining asymptotic results in parametric models Hien D. Nguyen 1 , 2 1 DECRA Research Fellow, Australian Research Council. 2 Lecturer, Department of Mathematics and Statistics, La Trobe University, Melbourne Australia. (Contact–Email: h.nguyen5@latrobe.edu.au, Twitter: @tresbienhien, Website: hiendn.github.io) S4D, Caen, 2018 June 21 1 / 67

Framework Suppose that we observe { Z i } from some data generating process (DGP). i ∈ { 1 ,..., n } . Define a function Q n ( θ θ θ ) that depends on { Z i } . θ θ θ ∈ Θ , where Θ is a subset of a Euclidean space. θ We call Q n the objective function and θ θ the parameter vector . We say that Θ is the parameter space . 2 / 67

Extremum estimation Following the nomenclature of Amemiya (1985), we say that the vector θ θ θ 0 ≡ argmax θ ∈ Θ Q ( θ θ θ ) θ θ is the extremum parameter of Q , where n − 1 Q n → Q in some sense (to be defined). We call ˆ θ θ θ θ n ≡ argmax θ ∈ Θ Q n ( θ θ ) θ θ the extremum estimator of θ θ θ 0 . 3 / 67

A rose by any other name... We call the process of obtaining the extremum estimator: extremum estimation . Extremum estimation has appeared in the literature under numerous names: Empirical risk minimization (Vapnik, 1998, 2000). M-estimation (Huber, 1964; Serfling, 1980). Minimum contrast estimation (Pfanzagl, 1969; Bickel and Docksum, 2000). 4 / 67

Some specific cases Important cases include: Generalized method of moments. Loss function minimization (e.g. fitting support vector machines, neural networks, etc.). Maximum likelihood estimation (including empirical-, partial-, penalized-, pseudo-, quasi-, restricted-, etc). Maximum a posteriori estimation. Minimum distance estimation (e.g. least-squares, least-absolute deviation, etc). 5 / 67

Statistical inference θ Since θ θ 0 is defined as the maximum of Q , it must contain some information regarding the DGP of { Z i } . 1. We hope that given Q n , ˆ θ θ θ n will provide us with the same information regarding Q , provided that n is large enough. 2. We also hope that ˆ θ θ θ n also has some DGP that is dependent on θ θ θ 0 , which allows us to assess a priori hypotheses regarding θ θ θ . 6 / 67

Ordinary least squares (1A) Suppose that we observe independent and identically distributed (IID) data pairs Z i = ( X i , Y i ) , where Y i = X ⊤ i θ θ θ ∗ + E i , where E ( E i ) = 0, and that the DGP of Z i is in some sense, well-behaved. θ ∗ ∈ Θ ⊂ R p and X i ∈ X ⊂ R p , p ∈ N , and { E i } is θ θ independent of { X i } . Define the (negative) sum-of-squares as n θ ) = − 1 � 2 � Y i − X ⊤ θ ∑ θ Q n ( θ i θ θ . 2 i = 1 The least-squares estimator is defined as n θ ∈ Θ − 1 � 2 � ˆ Y i − X ⊤ ∑ θ θ θ n ≡ argmax i θ θ θ . 2 θ θ i = 1 7 / 67

Ordinary least squares (1B) We can obtain ˆ θ θ θ n by solving the first-order condition (FOC) n � � Y i − X ⊤ ∑ θ ∇ Q n = X i i θ θ = 0 i = 1 n n X i X ⊤ ∑ ∑ = ⇒ i θ θ θ = X i Y i i = 1 i = 1 � − 1 n � n ⇒ ˆ X i X ⊤ ∑ ∑ θ = θ n = θ X i Y i . i i = 1 i = 1 More familiarly, if we put X ⊤ into the i th row of X n ∈ R n × p i and put Y i into the i th position of y n ∈ R n , then we can write � − 1 � ˆ X ⊤ X ⊤ θ θ n = θ n X n n y n . 8 / 67

Ordinary least squares (1C) Since ˆ θ θ θ n is an estimate of θ θ θ 0 , we must determine if there is a sensible relationship between Q n and θ θ θ 0 . p The following is a heuristic argument. Note that − → denotes convergence in probability . 1. Notice that n − 1 Q n = n − 1 ∑ n i = 1 g ( Z i ) , for some g ( Z i ) = − 1 � 2 � Y i − X ⊤ i θ θ θ . 2 2. Since Z i is well-behaved, then a weak law of large numbers implies that − 1 �� 2 � p n − 1 Q n Y i − X ⊤ − → E [ g ( Z i )] = i θ θ θ 2 E ≡ Q 9 / 67

Ordinary least squares (1D) 3. Suppose that we can exchange integration and differentiation, then the FOC implies that � � �� Y i − X ⊤ ∇ Q = i θ θ θ E X i � � �� X ⊤ θ ∗ + E i − X ⊤ θ θ = E X i i θ i θ θ � � � � X i X ⊤ X i X ⊤ = θ θ θ ∗ + E ( X i E i ) − E θ θ θ E i i 4. Under the assumption that E ( X i E i ) = 0 (e.g. independence between { X i } and { E i } ), we have � � � � X i X ⊤ X i X ⊤ 0 = θ θ θ ∗ − E θ θ θ E i i = ⇒ θ θ θ 0 = argmax θ ∈ Θ Q = θ θ θ ∗ θ θ θ Thus, in this case, we have found that θ θ 0 is the generative parameter θ θ θ ∗ ! 10 / 67

Consistency We must now make precise the notion regarding how ˆ θ θ θ n and θ θ θ 0 are related. p Earlier, we defined − → to denote convergence in probabilit y . We say that a random variable U n converges in probability to another random variable U , if for every ε > 0, we have n → ∞ P ( � U n − U � > ε ) = 0, lim where �·� is some appropriate norm (usually Euclidean, in our case). p We say that ˆ θ 0 , if ˆ θ θ θ n is a consistent estimator of θ θ θ θ θ n − → θ θ θ 0 . 11 / 67

Proving consistency (1) We present the consistency result of Amemiya (1985, Thm. 4.1.1). See also van der Vaart (1998, Thm. 5.7). Make the following assumptions: (A) The parameter space Θ is a compact subset of a Euclidean space R p ( p ∈ N ). (B) Q n ( θ θ θ ) is a continuous function in θ θ θ for all { Z i } , and measurable in { Z i } for all θ θ θ . (C) n − 1 Q n ( θ θ θ ) converges to a non-stochastic function Q ( θ θ θ ) in probability uniformly in θ θ θ over Θ . (D) Q ( θ θ θ ) obtains a unique global maximum at θ θ θ 0 . 12 / 67

Proving consistency (2) Under Assumptions (A)–(D), then the EE, defined as ˆ θ θ θ θ n ≡ argmax θ ∈ Θ Q n ( θ θ ) , θ θ p is consistent, in the sense that ˆ θ θ θ n − → θ θ θ 0 . Here, we say that n − 1 Q n ( θ θ ) converges in probability θ uniformly to Q ( θ θ θ ) , if for any ε > 0 � � � n − 1 Q n ( θ � > ε � � lim sup θ θ ) − Q ( θ θ θ ) = 0. n → ∞ P θ θ ∈ Θ θ 13 / 67

Uniform weak law of large numbers The most difficult part, in general, of applying Amemiya (1985, Thm. 4.1.1) is checking assumption (C). The main traditional tool that we will apply is the weak uniform law of large numbers of Jennrich (1969) (see also Amemiya, 1985, Thm. 4.2.1): θ ) = ∑ n Let Q n ( θ θ i = 1 g ( Z i ; θ θ θ ) be a measurable function of the IID sequence { Z i } , where Z i is supported in a Euclidean space, for each θ θ θ ∈ Θ , where Θ is compact and Euclidean. If E [ g ( Z i ; θ θ θ )] θ )] < ∞ , then n − 1 Q n ( θ exists, and E [ sup θ θ ∈ Θ g ( Z i ; θ θ θ θ ) converges in θ probability uniformly to Q ( θ θ θ ) = E [ g ( Z i ; θ θ θ )] . 14 / 67

Ordinary least squares (2A) Make the following assumptions: (a) { Z i } is and IID sequence and that the DGP of Z i = ( X i , Y i ) is X i X ⊤ � � such that E exists and is positive definite, E ( E i ) = 0, i = σ 2 < ∞ , and E ( X i E i ) = 0 , where E 2 � � E i Y i = X ⊤ i θ θ θ ∗ + E i . (b) The parameter space is Θ = [ − L , L ] p , where L is sufficiently large. 15 / 67

Ordinary least squares (2B) By (b), Θ is a compact Euclidean space, thus (A) is validated. θ ) = ∑ n We can write Q n ( θ θ i = 1 g ( Z i ; θ θ θ ) , where � 2 � Y i − X ⊤ = Y 2 i + Y i X ⊤ θ ⊤ X i X ⊤ θ θ θ θ − 2 g = i θ θ i θ θ − θ i θ θ and �� 2 � � � Y i − X ⊤ Y 2 Y i X ⊤ � � i θ θ θ = − 2 E θ θ θ E E i i � � θ ⊤ E X i X ⊤ + θ θ θ θ θ . i 16 / 67

Ordinary least squares (2C) Continuing from the previous slide, and applying (a), we have: �� 2 � � � � � Y i − X ⊤ θ ⊤ X i X ⊤ E i X ⊤ i θ θ θ = θ θ θ θ θ ∗ + 2 E θ θ θ ∗ E ∗ E i i � � θ ⊤ E E 2 X i X ⊤ � � + E − 2 θ θ θ θ ∗ θ i i � � � � E i X ⊤ θ ⊤ E X i X ⊤ θ θ θ − 2 E θ θ + θ θ θ i i � � � � θ ⊤ X i X ⊤ θ ⊤ E X i X ⊤ = θ θ θ θ θ ∗ − 2 θ θ θ ∗ θ θ ∗ E i i � � θ ⊤ E X i X ⊤ θ + σ 2 . θ θ + θ θ i X i X ⊤ � � Since E exists, Q n is measurable, and g is quadratic i θ in θ θ , thus it is continuous and we have the validation of (B). 17 / 67

Ordinary least squares (2D) Write Q n = ∑ n i = 1 g ( Z i ; θ θ θ ) , where θ ) = − 1 � 2 � Y i − X ⊤ θ θ g i ( Z i ; θ i θ θ . 2 From the previous slide, we have the fact that − 1 � � � � θ ⊤ X i X ⊤ θ ⊤ E X i X ⊤ θ θ θ θ θ E [ g ( Z i ; θ θ )] = 2 θ ∗ E θ θ ∗ + θ θ θ ∗ i i − 1 � � θ ⊤ E X i X ⊤ θ − σ 2 . θ θ 2 θ θ i By (b), Θ is compact, and we have established that g is continuous. Thus, via the Weierstrass extreme value theorem, � � θ E sup g ( Z i ; θ θ ) ≤ M < ∞ . θ θ θ ∈ Θ 18 / 67

Theory of statistical inference: a lazy approach to obtaining - PowerPoint PPT Presentation

Theory of statistical inference: a lazy approach to obtaining asymptotic results in parametric models Hien D. Nguyen 1 , 2 1 DECRA Research Fellow, Australian Research Council. 2 Lecturer, Department of Mathematics and Statistics, La Trobe

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Lazy v. Yield Incremental, Linear Pretty-printing Oleg Kiselyov Simon Peyton-Jones Amr Sabry

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Lazy Modules Keiko Nakata Institute of Cybernetics at Tallinn University of Technology

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Foundations for Inference I Dajiang Liu @PHS525 Feb-09-2016 Statistical Inference

UQ, STAT2201, 2017, Lecture 6 Unit 6 Statistical Inference Ideas. 1 Statistical Inference is

ROOT package management: lazy install approach Brian Bockelman, Oksana Shadura, Vassil

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Information Theory and Statistical Inference Samuel Cheng School of ECE University of Oklahoma

Chapter 3: Common Families of Distributions STK4011/9011: Statistical Inference Theory Johan

Chapter 8: Hypothesis Testing STK4011/9011: Statistical Inference Theory Johan Pensar

Chapter 2: Transformations and Expectations (a recap) STK4011/9011: Statistical Inference Theory

rtrs r trs

The universal invariant profile of the multiplicative group Greg Martin University of British

On asymptotic behaviour of the increments of sums of i.i.d. random variables from domains of

Gaussian complex zeros and eigenvalues: Rare events and the emergence of the forbidden

Bayesian Cross Validation in Regular Asymptotic Theory Information Geometry and Its Applications

Automating Asymptotics in a Theorem Prover Manuel Eberl Technical University of Munich Formal

Unit #2: Complexity Theory and Asymptotic Analysis CPSC 221: Algorithms and Data Structures Lars

Asymptotics Will Perkins January 22, 2013 Asymptotics In many theorems and questions in