statistics for applications chapter 7 regression
play

Statistics for Applications Chapter 7: Regression 1/43 Heuristics - PDF document

Statistics for Applications Chapter 7: Regression 1/43 Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43 Heuristics of the linear regression (2) I Idea: Fit the best line


  1. Statistics for Applications Chapter 7: Regression 1/43

  2. Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43

  3. Heuristics of the linear regression (2) I Idea: Fit the best line fitting the data. I Approximation: Y i ⇡ a + bX i , i = 1 , . . . , n , for some (unknown) a, b 2 I R . ˆ I Find ˆ a, b that approach a and b . d , I More generally: Y i 2 I R , X i 2 I R Y i ⇡ a + X > b, a 2 I d . R , b 2 I R i I Goal: Write a rigorous model and estimate a and b . 3/43

  4. Heuristics of the linear regression (3) Examples: Economics: Demand and price, D i ⇡ a + bp i , i = 1 , . . . , n. Ideal gas law: PV = nRT , log P i ⇡ a + b log V i + c log T i , i = 1 , . . . , n. 4/43

  5. Linear regression of a r.v. Y on a r.v. X (1) Let X and Y be two real r.v. (non necessarily independent) with two moments and such that V ar ( X ) 6 = 0 . The theoretical linear regression of Y on X is the best approximation in quadratic means of Y by a linear function of X , i.e. the r.v. a + bX , where a and b are the two real h i 2 . numbers minimizing I E ( Y − a − bX ) By some simple algebra: cov ( X, Y ) I b = , V ar ( X ) cov ( X, Y ) I a = I E[ Y ] − b I E[ X ] = I E[ Y ] − I E[ X ] . V ar ( X ) 5/43

  6. Linear regression of a r.v. Y on a r.v. X (2) If ε = Y − ( a + bX ) , then Y = a + bX + ε, with I E[ ε ] = 0 and cov ( X, ε ) = 0 . Conversely: Assume that Y = a + bX + ε for some a, b 2 I R and some centered r.v. ε that satisfies cov ( X, ε ) = 0 . E.g., if X ? ? ε or if I E[ ε | X ] = 0 , then cov ( X, ε ) = 0 . Then, a + bX is the theoretical linear regression of Y on X . 6/43

  7. Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 7/43

  8. Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 8/43

  9. Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 9/43

  10. Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 10/43

  11. Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , Y 1 ) , . . . , ( X n , Y n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 11/43

  12. Linear regression of a r.v. Y on a r.v. X (4) Definition The least squared error (LSE) estimator of ( a, b ) is the minimizer of the sum of squared errors: n X ( Y i − a − bX i ) 2 . i =1 ˆ) is given by (ˆ a, b ¯ ¯ XY − X Y ˆ b = , ¯ 2 X 2 − X Y − ˆ X. ¯ b ¯ a ˆ = 12/43

  13. Linear regression of a r.v. Y on a r.v. X (5) 13/43

  14. Multivariate case (1) Y i = X i β + ε i , i = 1 , . . . , n. R p (wlog, Vector of explanatory variables or covariates : X i 2 I assume its first coordinate is 1). Dependent variable : Y i . β = ( a, b ) ; β 1 (= a ) is called the intercept . { ε i } i =1 ,...,n : noise terms satisfying cov ( X i , ε i ) = 0 . Definition The least squared error (LSE) estimator of β is the minimizer of the sum of square errors: n ˆ = argmin X ( Y i − X i t ) 2 β t 2 I R p i =1 14/43

  15. Multivariate case (2) LSE in matrix form R n . Let Y = ( Y 1 , . . . , Y n ) 2 I Let X be the n ⇥ p matrix whose rows are X 1 , . . . , X ( X is n called the design ). R n (unobserved noise) Let ε = ( ε 1 , . . . , ε n ) 2 I Y = X β + ε . ˆ satisfies: The LSE β ˆ = argmin k Y − Xt k 2 2 . β R p t 2 I 15/43

  16. Multivariate case (3) Assume that rank ( X ) = p . Analytic computation of the LSE: ˆ = ( X X ) − 1 X Y . β Geometric interpretation of the LSE ˆ is the orthogonal projection of Y onto the subspace X β spanned by the columns of X : ˆ = P Y , X β where P = X ( X X ) − 1 X . 16/43

  17. Linear regression with deterministic design and Gaussian noise (1) Assumptions: The design matrix X is deterministic and rank ( X ) = p . The model is homoscedastic : ε 1 , . . . , ε n are i.i.d. The noise vector ε is Gaussian: ε ⇠ N n (0 , σ 2 I n ) , for some known or unknown σ 2 > 0 . 17/43

  18. Linear regression with deterministic design and Gaussian noise (2) ⇣ β , σ 2 ( X X ) − 1 . ⌘ ˆ LSE = MLE: β ⇠ N p h ˆ − β k 2 = σ 2 tr ( X X ) − 1 . i ⇣ ⌘ ˆ : Quadratic risk of β E k β I 2 i h ˆ k 2 = σ 2 ( n − p ) . Prediction error: E k Y − X β I 2 1 k Y − X β ˆ 2 = ˆ k 2 2 . Unbiased estimator of σ 2 : σ n − p Theorem 2 σ ˆ ⇠ χ 2 ( n − p ) σ . n − p 2 ˆ 2 . ˆ β ? ? σ 18/43

  19. Significance tests (1) Test whether the j -th explanatory variable is significant in the linear regression ( 1  j  p ). H 0 : β j = 0 v.s. H 1 : β j = 0 . If γ j is the j -th diagonal coe ffi cient of ( X X ) − 1 ( γ j > 0 ): ˆ j − β j β ⇠ t n − p . p ˆ 2 γ j σ ˆ j β ( j ) = p Let T . n ˆ 2 γ j σ Test with non asymptotic level α 2 (0 , 1) : δ ( j ) = 1 {| T ( j ) | > q α ( t n − p ) } , n α 2 where q α ( t n − p ) is the (1 − α/ 2) -quantile of t n − p . 2 19/43

  20. Significance tests (2) Test whether a group of explanatory variables is significant in the linear regression. H 0 : β j = 0 , 8 j 2 S v.s. H 1 : 9 j 2 S, β j = 0 , where S ✓ { 1 , . . . , p } . Bonferroni’s test : δ B = max δ ( j ) , where k = | S | . α α/k j 2 S δ α has non asymptotic level at most α . 20/43

  21. More tests (1) R k . Let G be a k ⇥ p matrix with rank ( G ) = k ( k  p ) and λ 2 I Consider the hypotheses: H 0 : G β = λ v.s. H 1 : G β = λ . The setup of the previous slide is a particular case. If H 0 is true, then: ˆ − λ ⇠ N k 0 , σ 2 G ( X X ) − 1 G , G β and − 1 ˆ − λ ) ( G β − λ ) ⇠ χ 2 σ − 2 ( G β G ( X X ) − 1 G k . 21/43

  22. More tests (2) � − 1 ( G β − λ ) . ( G ˆ G ( X X ) − 1 G β − λ ) � Let S n = 1 σ 2 k ˆ If H 0 is true, then S n ⇠ F k,n − p . Test with non asymptotic level α 2 (0 , 1) : δ α = 1 { S n > q α ( F k,n − p ) } , where q α ( F k,n − p ) is the (1 − α ) -quantile of F k,n − p . Definition The Fisher distribution with p and q degrees of freedom , denoted U/p by F p,q , is the distribution of , where: V/q U ⇠ χ 2 , V ⇠ χ 2 , q p U ? ? V . 22/43

  23. Concluding remarks Linear regression exhibits correlations, NOT causality Normality of the noise: One can use goodness of fit tests to ˆ are Gaussian. test whether the residuals ε ˆ i = Y i − X i β Deterministic design: If X is not deterministic, all the above can be understood conditionally on X , if the noise is assumed to be Gaussian, conditionally on X . 23/43

  24. Linear regression and lack of identifiability (1) Consider the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown; 3. ε ⇠ N n (0 , σ 2 I n ) . Previously, we assumed that X had rank p , so we could invert X X . What if X is not of rank p ? E.g., if p > n ? β would no longer be identified: estimation of β is vain (unless we add more structure). 24/43

  25. Linear regression and lack of identifiability (2) What about prediction ? X β is still identified. ˆ Y : orthogonal projection of Y onto the linear span of the columns of X . ˆ = X β ˆ = X ( X X ) † XY , where A † stands for the Y (Moore-Penrose) pseudo inverse of a matrix A . Similarly as before, if k = rank ( X ) : ˆ − Y k 2 k Y 2 ⇠ χ 2 n − k , σ 2 ˆ − Y k 2 ˆ . ? Y k Y 2 ? 25/43

  26. Linear regression and lack of identifiability (3) In particular: 2 ] = ( n − k ) σ 2 . ˆ − Y k 2 E[ k Y I Unbiased estimator of the variance: 1 ˆ 2 = 2 . ˆ − Y k 2 k Y σ n − k 26/43

  27. Linear regression in high dimension (1) Consider again the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown: to be estimated; 3. ε ⇠ N n (0 , σ 2 I n ) . R p is the vector of covariates of the i -th For each i , X i 2 I individual. If p is too large ( p > n ), there are too many parameters to be estimated (overfitting model), although some covariates may be irrelevant. Solution: Reduction of the dimension. 27/43

  28. Linear regression in high dimension (2) Idea: Assume that only a few coordinates of β are nonzero (but we do not know which ones). Based on the sample, select a subset of covariates and estimate the corresponding coordinates of β . For S ✓ { 1 , . . . , p } , let ˆ S 2 argmin k Y − X S t k 2 , β R S t 2 I where X S is the submatrix of X obtained by keeping only the covariates indexed in S . 28/43

  29. Linear regression in high dimension (3) Select a subset S that minimizes the prediction error penalized by the complexity (or size) of the model: k Y − X S β S k 2 + λ | S | , ˆ where λ > 0 is a tuning parameter. σ 2 , this is the Mallow’s C p or AIC criterion. If λ = 2ˆ ˆ 2 log n , this is the BIC criterion. If λ = σ 29/43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend