Likelihood Ratio Test in High-Dimensional Logistic Regression Is - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University

Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics & Math 2/ 26

In memory of Tom Cover (1938 - 2012) Tom @ Stanford EE “ We all know the feeling that follows when one investigates a problem, goes through a large amount of algebra, and finally investigates the answer to find that the entire problem is illuminated not by the analysis but by the inspection of the answer ” 3/ 26

Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β 4/ 26

Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β One wishes to determine which covariate is of importance, i.e. β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) 4/ 26

Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics 5/ 26

Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics Wald statistic → χ 2 • Wald test: log-likelihood ratio statistic → χ 2 • Likelihood ratio test: • Score test: score → N ( 0 , Fisher Info ) • ... 5/ 26

Example: logistic regression in R ( n = 100 , p = 30 ) > fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7727 -0.8718 0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642 . X4 -0.293916 0.331553 -0.886 0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 ** ... --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Can these inference calculations (e.g. p-values) be trusted? 6/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE 7/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE • � β ( − j ) = arg max β : β j =0 ℓ ( β ) : constrained MLE 7/ 26

Wilks’ phenomenon ’1938 Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

Wilks’ phenomenon ’1938 p-value ‰ 2 1 assess significance of coefficients Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

Classical LRT in high dimensions 6000 p/n ∈ (1 , ∞ ) 4000 Linear regression 2000 y = Xβ + η �� i.i.d. Gaussian 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are h onuniform For linear regression (with Gaussian noise) in high dimensions, 2 LLR j ∼ χ 2 (classical test always works) 1 9/ 26

Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform 10/ 26

Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform Wilks’ theorem seems inadequate in accommodating logistic regression in high dimensions 10/ 26

Bartlett correction? ( n = 4000 , p = 1200 ) 30000 20000 Counts 10000 0 0.00 0.25 0.50 0.75 1.00 P−Values classical Wilks 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 11/ 26

Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect 11/ 26

Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect What happens in high dimensions? 11/ 26

Our findings 6000 30000 10000 4000 20000 Counts Counts Counts 5000 2000 10000 0 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P−Values P−Values P−Values rescaled χ 2 classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect • A glimpse of our theory: LRT follows a rescaled χ 2 distribution 11/ 26

Problem formulation (formal) X y X y β n p n p X y ind. • Gaussian design: X i ∼ N ( 0 , Σ ) • Logistic model:  1  1 , with prob. 1+exp( − X ⊤ i β ) y i = 1 ≤ i ≤ n  1 − 1 , with prob. 1+exp( X ⊤ i β ) • Proportional growth: p/n → constant • Global null: β = 0 12/ 26

When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� ≤ 0 2 { } 2 {− } y i = 1 y i = − 1 MLE is unbounded if ∃ perfect separating hyperplane 13/ 26

When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� ≤ 0 If ∃ a hyperplane that perfectly separates { y i } , i.e. ∃ � i � s.t. y i X ⊤ β β > 0 for all i then MLE is unbounded a � a →∞ ℓ ( lim β ) = 0 �� unbounded 13/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 number of samples n increases = ⇒ more difficult to find separating hyperplane 14/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 Theorem 1 (Cover ’1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high prob. iff n/p < 2 (asymptotically) 14/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� rescaled χ 2 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns � (Ψ( τZ ; b )) 2 � τ 2 = n p E p � Ψ ′ ( τZ ; b ) � n = E where Z ∼ N (0 , 1) , Ψ is some operator, and α ( p/n ) = τ 2 /b 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns ◦ α ( · ) depends only on aspect ratio p/n ◦ It is not a finite sample effect ◦ α (0) = 1 : matches classical theory 15/ 26

Our adjusted LRT theory in practice rescaling constant α ( p/n ) 6000 ● 2.00 ● ● ● Rescaling Constant ● 1.75 ● 4000 ● ● ● Counts ● 1.50 ● ● ● ● ● ● ● ● 2000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.25 1.00 ● ● 0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n 0.00 0.25 0.50 0.75 1.00 P−Values rescaling constant for logistic model empirical p-values ≈ Unif (0 , 1) Empirically, LRT ≈ rescaled χ 2 1 (as predicted) 16/ 26

Likelihood Ratio Test in High-Dimensional Logistic Regression Is - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Testing Likelihood ratio test Michel Bierlaire Introduction to choice models Applications of

THE GOLDEN RATIO AND THE FIBONACCI NUMBERS Common Measures 1 foot 2 feet 3 feet 3 2 Ratio

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Ope ratio ns Ope ratio ns Wo rksho p Wo rksho p 2005 2005 USCG Auxiliary Ope ratio ns De

Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance Douglas

Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul

Computing and using the deviance with classification trees Gilbert Ritschard Dept of

Re-Placing Research in the Literature Classroom Aaron Brenner Robin Kear Amy Twyning

Session 07 GLM extensions The Negative Binomial distribution Probability function (

Analysis of Count Data A Business Perspective George J. Hurley Sr. Research Manager The

for Poisson Regression 1 Outline Example 3: Recall of Stressful Events Goodness of fit

Introduction to Bayesian Statistics Lecture 11: Model Comparison Rung-Ching Tsai Department of

Likelihood Ratio Test in High-Dimensional Logistic Regression Is - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Testing Likelihood ratio test Michel Bierlaire Introduction to choice models Applications of

THE GOLDEN RATIO AND THE FIBONACCI NUMBERS Common Measures 1 foot 2 feet 3 feet 3 2 Ratio

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Ope ratio ns Ope ratio ns Wo rksho p Wo rksho p 2005 2005 USCG Auxiliary Ope ratio ns De

Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance Douglas

Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul

Computing and using the deviance with classification trees Gilbert Ritschard Dept of

Re-Placing Research in the Literature Classroom Aaron Brenner Robin Kear Amy Twyning

Session 07 GLM extensions The Negative Binomial distribution Probability function (

Analysis of Count Data A Business Perspective George J. Hurley Sr. Research Manager The

for Poisson Regression 1 Outline Example 3: Recall of Stressful Events Goodness of fit

Introduction to Bayesian Statistics Lecture 11: Model Comparison Rung-Ching Tsai Department of

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for