Likelihood Ratio Test in High-Dimensional Logistic Regression Is - - PowerPoint PPT Presentation

likelihood ratio test in high dimensional logistic
SMART_READER_LITE
LIVE PREVIEW

Likelihood Ratio Test in High-Dimensional Logistic Regression Is - - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics


slide-1
SLIDE 1

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Yuxin Chen Electrical Engineering, Princeton University

slide-2
SLIDE 2

Coauthors

Pragya Sur Stanford Statistics Emmanuel Cand` es Stanford Statistics & Math

2/ 26

slide-3
SLIDE 3

In memory of Tom Cover (1938 - 2012)

Tom @ Stanford EE

“We all know the feeling that follows when one investigates a problem, goes

through a large amount of algebra, and finally investigates the answer to find that

the entire problem is illuminated not by the analysis but by the inspection of the answer”

3/ 26

slide-4
SLIDE 4

Inference in regression problems

Example: logistic regression yi ∼ logistic-model

x⊤

i β

,

1 ≤ i ≤ n

4/ 26

slide-5
SLIDE 5

Inference in regression problems

Example: logistic regression yi ∼ logistic-model

x⊤

i β

,

1 ≤ i ≤ n One wishes to determine which covariate is of importance, i.e. βj = 0 vs. βj = 0 (1 ≤ j ≤ p)

4/ 26

slide-6
SLIDE 6

Classical tests

βj = 0 vs. βj = 0 (1 ≤ j ≤ p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics

5/ 26

slide-7
SLIDE 7

Classical tests

βj = 0 vs. βj = 0 (1 ≤ j ≤ p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics

  • Wald test:

Wald statistic → χ2

  • Likelihood ratio test:

log-likelihood ratio statistic → χ2

  • Score test:

score → N(0, Fisher Info)

  • ...

5/ 26

slide-8
SLIDE 8

Example: logistic regression in R (n = 100, p = 30)

> fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max

  • 1.7727
  • 0.8718

0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642 . X4

  • 0.293916

0.331553

  • 0.886

0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 ** ...

  • Signif. codes:

0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Can these inference calculations (e.g. p-values) be trusted?

6/ 26

slide-9
SLIDE 9

This talk: likelihood ratio test (LRT)

βj = 0 vs. βj = 0 (1 ≤ j ≤ p) Log-likelihood ratio (LLR) statistic LLRj := ℓ β

− ℓ

β(−j)

  • ℓ(·): log-likelihood

β = arg maxβ ℓ(β): unconstrained MLE

7/ 26

slide-10
SLIDE 10

This talk: likelihood ratio test (LRT)

βj = 0 vs. βj = 0 (1 ≤ j ≤ p) Log-likelihood ratio (LLR) statistic LLRj := ℓ β

− ℓ

β(−j)

  • ℓ(·): log-likelihood

β = arg maxβ ℓ(β): unconstrained MLE

β(−j) = arg maxβ:βj=0 ℓ(β): constrained MLE

7/ 26

slide-11
SLIDE 11

Wilks’ phenomenon ’1938

Samuel Wilks, Princeton

βj = 0 vs. βj = 0 (1 ≤ j ≤ p) LRT asymptotically follows chi-square distribution (under null) 2 LLRj

d

− → χ2

1

(p fixed, n → ∞)

8/ 26

slide-12
SLIDE 12

Wilks’ phenomenon ’1938

‰2

1

p-value

Samuel Wilks, Princeton

assess significance of coefficients βj = 0 vs. βj = 0 (1 ≤ j ≤ p) LRT asymptotically follows chi-square distribution (under null) 2 LLRj

d

− → χ2

1

(p fixed, n → ∞)

8/ 26

slide-13
SLIDE 13

Classical LRT in high dimensions

p/n ∈ (1, ∞) Linear regression y = Xβ + η

  • i.i.d. Gaussian

2000 4000 6000 0.00 0.25 0.50 0.75 1.00 P−Values

classical p-values are h

  • nuniform

For linear regression (with Gaussian noise) in high dimensions, 2LLRj ∼ χ2

1

(classical test always works)

9/ 26

slide-14
SLIDE 14

Classical LRT in high dimensions

p = 1200, n = 4000 Logistic regression y ∼ logistic-model(Xβ)

5000 10000 15000 0.00 0.25 0.50 0.75 1.00 P−Values

classical p-values are highly nonuniform

10/ 26

slide-15
SLIDE 15

Classical LRT in high dimensions

p = 1200, n = 4000 Logistic regression y ∼ logistic-model(Xβ)

5000 10000 15000 0.00 0.25 0.50 0.75 1.00 P−Values

classical p-values are highly nonuniform

Wilks’ theorem seems inadequate in accommodating logistic regression in high dimensions

10/ 26

slide-16
SLIDE 16

Bartlett correction? (n = 4000, p = 1200)

10000 20000 30000 0.00 0.25 0.50 0.75 1.00 P−Values Counts

classical Wilks

  • Bartlett correction (finite sample effect):

2LLRj 1+αn/n ∼ χ2 1

11/ 26

slide-17
SLIDE 17

Bartlett correction? (n = 4000, p = 1200)

10000 20000 30000 0.00 0.25 0.50 0.75 1.00 P−Values Counts 5000 10000 0.25 0.50 0.75 1.00 P−Values Counts

classical Wilks Bartlett-corrected

  • Bartlett correction (finite sample effect):

2LLRj 1+αn/n ∼ χ2 1

  • p-values are still non-uniform −

→ this is NOT finite sample effect

11/ 26

slide-18
SLIDE 18

Bartlett correction? (n = 4000, p = 1200)

10000 20000 30000 0.00 0.25 0.50 0.75 1.00 P−Values Counts 5000 10000 0.25 0.50 0.75 1.00 P−Values Counts

classical Wilks Bartlett-corrected

  • Bartlett correction (finite sample effect):

2LLRj 1+αn/n ∼ χ2 1

  • p-values are still non-uniform −

→ this is NOT finite sample effect

What happens in high dimensions?

11/ 26

slide-19
SLIDE 19

Our findings

10000 20000 30000 0.00 0.25 0.50 0.75 1.00 P−Values Counts 5000 10000 0.25 0.50 0.75 1.00 P−Values Counts 2000 4000 6000 0.00 0.25 0.50 0.75 1.00 P−Values Counts

classical Wilks Bartlett-corrected rescaled χ2

  • Bartlett correction (finite sample effect):

2LLRj 1+αn/n ∼ χ2 1

  • p-values are still non-uniform −

→ this is NOT finite sample effect

  • A glimpse of our theory: LRT follows a rescaled χ2 distribution

11/ 26

slide-20
SLIDE 20

Problem formulation (formal)

X y X y β X y n p n p

  • Gaussian design: Xi

ind.

∼ N(0, Σ)

  • Logistic model:

yi =

  

1, with prob.

1 1+exp(−X⊤

i β)

−1, with prob.

1 1+exp(X⊤

i β)

1 ≤ i ≤ n

  • Proportional growth: p/n → constant
  • Global null: β = 0

12/ 26

slide-21
SLIDE 21

When does MLE exist?

(MLE) maximizeβ ℓ(β) = −

n

  • i=1

log

  • 1 + exp(−yiX⊤

i β)

  • ≤0

2 { } 2 {− }

yi = 1 yi = −1 MLE is unbounded if ∃ perfect separating hyperplane

13/ 26

slide-22
SLIDE 22

When does MLE exist?

(MLE) maximizeβ ℓ(β) = −

n

  • i=1

log

  • 1 + exp(−yiX⊤

i β)

  • ≤0

If ∃ a hyperplane that perfectly separates {yi}, i.e. ∃ β s.t. yiX⊤

i

β > 0 for all i then MLE is unbounded lim

a→∞ ℓ(

a β

  • unbounded

) = 0

13/ 26

slide-23
SLIDE 23

When does MLE exist?

Separating capacity (Tom Cover, Ph. D. thesis ’1965) n = 2 n = 4 n = 12

yi = 1 yi = −1

number of samples n increases = ⇒ more difficult to find separating hyperplane

14/ 26

slide-24
SLIDE 24

When does MLE exist?

Separating capacity (Tom Cover, Ph. D. thesis ’1965) n = 2 n = 4 n = 12

yi = 1 yi = −1

Theorem 1 (Cover ’1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high

  • prob. iff n/p < 2 (asymptotically)

14/ 26

slide-25
SLIDE 25

Main result: asymptotic distribution of LRT

Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, 2 LLRj

d

− → α

p

n

  • χ2

1

  • rescaled χ2

15/ 26

slide-26
SLIDE 26

Main result: asymptotic distribution of LRT

Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, 2 LLRj

d

− → α

p

n

  • χ2

1

  • rescaled χ2
  • α(p/n) can be determined by solving a system of 2 nonlinear

equations and 2 unknowns τ 2 = n p E

  • (Ψ(τZ; b))2

p n = E

Ψ′(τZ; b)

  • where Z ∼ N(0, 1), Ψ is some operator, and α(p/n) = τ 2/b 15/ 26
slide-27
SLIDE 27

Main result: asymptotic distribution of LRT

Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, 2 LLRj

d

− → α

p

n

  • χ2

1

  • rescaled χ2
  • α(p/n) can be determined by solving a system of 2 nonlinear

equations and 2 unknowns

  • α(·) depends only on aspect ratio p/n
  • It is not a finite sample effect
  • α(0) = 1: matches classical theory

15/ 26

slide-28
SLIDE 28

Our adjusted LRT theory in practice

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 1.00

1.25 1.50 1.75 2.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

κ Rescaling Constant

rescaling constant α(p/n) p/n

2000 4000 6000 0.00 0.25 0.50 0.75 1.00

P−Values Counts

rescaling constant for logistic model empirical p-values ≈ Unif(0, 1) Empirically, LRT ≈ rescaled χ2

1 (as predicted)

16/ 26

slide-29
SLIDE 29

Validity of tail approximation

  • 0.0000

0.0025 0.0050 0.0075 0.0100 0.000 0.002 0.004 0.006 0.008 0.010

t Empirical cdf

Empirical CDF of adjusted pvalues (n = 4000, p = 1200) Empirical CDF is in near-perfect aggreement with diagonal, suggesting that our theory is remarkably accurate even when we zoom in

17/ 26

slide-30
SLIDE 30

Efficacy under moderate sample sizes

  • 0.00

0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

t Empirical cdf

Empirical CDF of adjusted pvalues (n = 200, p = 60) Our theory seems adequete for moderately large samples

18/ 26

slide-31
SLIDE 31

Universality: non-Gaussian covariates

10000 20000 30000 0.00 0.25 0.50 0.75 1.00 P−Values Counts 5000 10000 15000 0.25 0.50 0.75 1.00 P−Values Counts 2500 5000 7500 10000 12500 0.00 0.25 0.50 0.75 1.00 P−Values Counts

classical Wilks Bartlett-corrected rescaled χ2

i.i.d. Bernoulli design, n = 4000, p = 1200

19/ 26

slide-32
SLIDE 32

Connection to convex geometry

20/ 26

slide-33
SLIDE 33

Connection to convex geometry

separating hyperplane exists no separating hyperplane \ Rn

+

)range(X)

WLOG, if y1 = · · · = yn = 1, then separability =

  • range(X) ∩ Rn

+ = {0}

  • can be analyzed via convex geometry (e.g. Amelunxen et al.)

20/ 26

slide-34
SLIDE 34

Connection to robust M-estimation

Since yi = ±1 and Xi

ind.

∼ N(0, Σ), maximizeβ ℓ(β) = −

n

  • i=1

log

  • 1 + exp(−yiX⊤

i β)

  • d

= −

n

  • i=1

log

  • 1 + exp(−X⊤

i β)

  • :=n

i=1 ρ(−Xiβ) 21/ 26

slide-35
SLIDE 35

Connection to robust M-estimation

Since yi = ±1 and Xi

ind.

∼ N(0, Σ), maximizeβ ℓ(β) = −

n

  • i=1

log

  • 1 + exp(−yiX⊤

i β)

  • d

= −

n

  • i=1

log

  • 1 + exp(−X⊤

i β)

  • :=n

i=1 ρ(−Xiβ)

= ⇒ MLE β = arg min

β n

  • i=1

ρ(yi − Xiβ)

  • robust M-estimation

with y = 0

21/ 26

slide-36
SLIDE 36

Connection to robust M-estimation

MLE β = arg min

β n

  • i=1

ρ(yi − Xiβ)

  • robust M-estimation

Variance inflation as p/n ↓ (El Karoui et al. ’13, Donoho, Montanari ’13)

22/ 26

slide-37
SLIDE 37

Connection to robust M-estimation

MLE β = arg min

β n

  • i=1

ρ(yi − Xiβ)

  • robust M-estimation

Variance inflation as p/n ↓ (El Karoui et al. ’13, Donoho, Montanari ’13)

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 1.00

1.25 1.50 1.75 2.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

κ Rescaling Constant

rescaling constant α(p/n) p/n

variance inflation − → increasing rescaling factor

22/ 26

slide-38
SLIDE 38

This is not just about logistic regression

Our theory is applicable to

  • logistic model
  • probit model
  • linear model (under Gaussian noise)
  • rescaling const α(p/n) ≡ 1 (consistent with classical theory)
  • linear model (under non-Gaussian noise)
  • · · ·

23/ 26

slide-39
SLIDE 39

Key ingredients in establishing our theory

Key step is to realize that 2 LLRj

d

− → p b(p/n)

  • β2

j

  • rescaled χ2

where b(·) depends only on p

n,

βj ∼ N

  • 0,
  • α( p

n)b( p n)

√p

  • 24/ 26
slide-40
SLIDE 40

Key ingredients in establishing our theory

Key step is to realize that 2 LLRj

d

− → p b(p/n)

  • β2

j

  • rescaled χ2

where b(·) depends only on p

n,

βj ∼ N

  • 0,
  • α( p

n)b( p n)

√p

  • 1. Convex geometry: show

β = O(1)

  • 2. Approximate message passing: determine asymptotic

distribution of β

  • 3. Leave-one-out analysis: characterize marginal distribution of
  • βj (rescaled Gaussian) and ensure higher-order terms vanish

24/ 26

slide-41
SLIDE 41

A small sample of prior work

  • Cand`

es, Fan, Janson, Lv ’16: observed empirically nonuniformity of p-values in logistic regression

  • Fan, Demirkaya, Lv ’17: classical asymptotic normality of MLE

(basis of Wald test) fails to hold in logistic regression when p ≍ n2/3

  • El Karoui, Beana, Bickel, Limb, Yu ’13, El Karoui ’13,

Donoho, Montanari ’13, El Karoui ’15: robust M-estimation for linear models (assuming strong convexity)

25/ 26

slide-42
SLIDE 42

Summary

  • Caution needs to be exercised when applying classical statistical

procedures in a high-dimensional regime — a regime of growing interest and relevance

  • What shall we do under non-null (β = 0)?

Paper: “The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square”, Pragya Sur, Yuxin Chen, Emmanuel Cand` es, 2017.

26/ 26