likelihood ratio test in high dimensional logistic
play

Likelihood Ratio Test in High-Dimensional Logistic Regression Is - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics


  1. Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University

  2. Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics & Math 2/ 26

  3. In memory of Tom Cover (1938 - 2012) Tom @ Stanford EE “ We all know the feeling that follows when one investigates a problem, goes through a large amount of algebra, and finally investigates the answer to find that the entire problem is illuminated not by the analysis but by the inspection of the answer ” 3/ 26

  4. Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β 4/ 26

  5. Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β One wishes to determine which covariate is of importance, i.e. β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) 4/ 26

  6. Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics 5/ 26

  7. Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics Wald statistic → χ 2 • Wald test: log-likelihood ratio statistic → χ 2 • Likelihood ratio test: • Score test: score → N ( 0 , Fisher Info ) • ... 5/ 26

  8. Example: logistic regression in R ( n = 100 , p = 30 ) > fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7727 -0.8718 0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642 . X4 -0.293916 0.331553 -0.886 0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 ** ... --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Can these inference calculations (e.g. p-values) be trusted? 6/ 26

  9. This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE 7/ 26

  10. This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE • � β ( − j ) = arg max β : β j =0 ℓ ( β ) : constrained MLE 7/ 26

  11. Wilks’ phenomenon ’1938 Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

  12. Wilks’ phenomenon ’1938 p-value ‰ 2 1 assess significance of coefficients Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

  13. Classical LRT in high dimensions 6000 p/n ∈ (1 , ∞ ) 4000 Linear regression 2000 y = Xβ + η ���� i.i.d. Gaussian 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are h onuniform For linear regression (with Gaussian noise) in high dimensions, 2 LLR j ∼ χ 2 (classical test always works) 1 9/ 26

  14. Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform 10/ 26

  15. Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform Wilks’ theorem seems inadequate in accommodating logistic regression in high dimensions 10/ 26

  16. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 20000 Counts 10000 0 0.00 0.25 0.50 0.75 1.00 P−Values classical Wilks 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 11/ 26

  17. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect 11/ 26

  18. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect What happens in high dimensions? 11/ 26

  19. Our findings 6000 30000 10000 4000 20000 Counts Counts Counts 5000 2000 10000 0 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P−Values P−Values P−Values rescaled χ 2 classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect • A glimpse of our theory: LRT follows a rescaled χ 2 distribution 11/ 26

  20. Problem formulation (formal) X y X y β n p n p X y ind. • Gaussian design: X i ∼ N ( 0 , Σ ) • Logistic model:  1  1 , with prob. 1+exp( − X ⊤ i β ) y i = 1 ≤ i ≤ n  1 − 1 , with prob. 1+exp( X ⊤ i β ) • Proportional growth: p/n → constant • Global null: β = 0 12/ 26

  21. When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� � ≤ 0 2 { } 2 {− } y i = 1 y i = − 1 MLE is unbounded if ∃ perfect separating hyperplane 13/ 26

  22. When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� � ≤ 0 If ∃ a hyperplane that perfectly separates { y i } , i.e. ∃ � i � s.t. y i X ⊤ β β > 0 for all i then MLE is unbounded a � a →∞ ℓ ( lim β ) = 0 ���� unbounded 13/ 26

  23. When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 number of samples n increases = ⇒ more difficult to find separating hyperplane 14/ 26

  24. When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 Theorem 1 (Cover ’1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high prob. iff n/p < 2 (asymptotically) 14/ 26

  25. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 15/ 26

  26. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns � (Ψ( τZ ; b )) 2 � τ 2 = n p E p � Ψ ′ ( τZ ; b ) � n = E where Z ∼ N (0 , 1) , Ψ is some operator, and α ( p/n ) = τ 2 /b 15/ 26

  27. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns ◦ α ( · ) depends only on aspect ratio p/n ◦ It is not a finite sample effect ◦ α (0) = 1 : matches classical theory 15/ 26

  28. Our adjusted LRT theory in practice rescaling constant α ( p/n ) 6000 ● 2.00 ● ● ● Rescaling Constant ● 1.75 ● 4000 ● ● ● Counts ● 1.50 ● ● ● ● ● ● ● ● 2000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.25 1.00 ● ● 0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n 0.00 0.25 0.50 0.75 1.00 P−Values rescaling constant for logistic model empirical p-values ≈ Unif (0 , 1) Empirically, LRT ≈ rescaled χ 2 1 (as predicted) 16/ 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend