De-biasing arbitrary convex regularizers and asymptotic normality - - PowerPoint PPT Presentation

de biasing arbitrary convex regularizers and asymptotic
SMART_READER_LITE
LIVE PREVIEW

De-biasing arbitrary convex regularizers and asymptotic normality - - PowerPoint PPT Presentation

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020 Joint work with Cun-Hui Zhang (Rutgers). Second order Poincar inequalities and


slide-1
SLIDE 1

De-biasing arbitrary convex regularizers and asymptotic normality

Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020

slide-2
SLIDE 2

Joint work with Cun-Hui Zhang (Rutgers). ◮ Second order Poincaré inequalities and de-biasing arbitrary convex regularizers arXiv:1912.11943 ◮ De-biasing the Lasso with degrees-of-freedom adjustment. arXiv:1902.08885.

slide-3
SLIDE 3

High-dimensional statistics

◮ n data points (xi, Yi, i = 1, ..., n) ◮ p covariates, xi ∈ Rp

p ≥ n, p ≥ cn p ≥ nα

For instance, linear model Yi = x⊤

i β + ǫi for unknown β

slide-4
SLIDE 4

M-estimators and regularization

ˆ β = arg min

b∈Rp

  • 1

n

n

  • i=1

ℓ(x⊤

i b, Yi) + regularizer(b)

  • for some loss ℓ(·, ·) and regularization penalty.

Typically in the linear model, with the least-squares loss,

ˆ β = arg min

b∈Rp

  • y − Xb2/(2n) + g(b)
  • with g convex.

Example

◮ Lasso, Elastic-Net ◮ Bridge g(b) = p

j=1 |bj|c

◮ Group-Lasso ◮ Nuclear Norm penalty ◮ Sorted L1 penalty (SLOPE)

slide-5
SLIDE 5

Different goals, different scales

ˆ β = arg minb∈Rp

y − Xb2/(2n) + g(b) ,

g convex

  • 1. Design of regularizer g with intuition about complexity,

structure

◮ convex relaxation of unknown structure (sparsity, low-rank) ◮ ℓ1 balls are spiky at sparse vectors

  • 2. Upper and lower bounds on the risk of ˆ

β: crn ≤ ˆ β − β2 ≤ Crn.

  • 3. Characterization of the risk

ˆ β − β2 = rn(1 + oP(1)) under some asymptotics, e.g., p/n → γ or s log(p/s)/n → 0.

  • 4. Asymp. distribution in fixed direction a0 ∈ Rp (resp a0 = ej)

and confidence interval for a⊤

0 β (resp βj)

√na⊤

0 (ˆ

β−β) →? N(0, V0), √n( βj−βj) →? N(0, Vj).

slide-6
SLIDE 6

Focus of today: Confidence interval in the linear model

based on convex regularized estimators of the form ˆ β = arg minb∈Rp

y − Xb2/(2n) + g(b) ,

g convex √n(ˆ bj − βj) ⇒ N(0, Vj), βj unknown parameter of interest

slide-7
SLIDE 7

Confidence interval in the linear model

Design X with iid N(0, Σ) rows, known Σ, noise ε ∼ N(0, σ2In), y = Xβ + ε, and a given initial estimator ˆ β.

Goal: Inference for θ = a⊤

0 β, projection in direction a0

Examples: ◮ a0 = ej, interested in inference on the j-th coefficient βj ◮ a0 = xnew where xnew is the characteristics of a new patient, inference for xnew ⊤β.

slide-8
SLIDE 8

De-biasing, confidence intervals for the Lasso

slide-9
SLIDE 9

Confidence interval in the linear model

Design X with iid N(0, Σ) rows, known Σ, noise ε ∼ N(0, σ2In), y = Xβ + ε, and a given initial estimator ˆ β.

Goal: Inference for θ = a⊤

0 β, projection in direction a0

Examples: ◮ a0 = ej, interested in inference on the j-th coefficient βj ◮ a0 = xnew where xnew is the characteristics of a new patient, inference for xnew ⊤β.

De-biasing: construct an unbiased estimate in the direction a0

i.e., find a correction such that [a⊤

0 ˆ

β−correction] is an unbiased estimator of a⊤

0 β∗

slide-10
SLIDE 10

Existing results

Lasso

◮ Zhang and Zhang (2014) (s log(p/s)/n → 0) ◮ Javanmard and Montanari (2014a) ; Javanmard and Montanari (2014b) ; Javanmard and Montanari (2018) (s log(p/s)/n → 0) ◮ Van de Geer et al. (2014) (s log(p/s)/n → 0) ◮ Bayati and Montanari (2012) ; Miolane and Montanari (2018) (p/n → γ)

Beyond Lasso?

◮ Robust M-estimators El Karoui et al. (2013) Lei, Bickel, and El Karoui (2018) Donoho and Montanari (2016) (p/n → γ) ◮ Celentano and Montanari (2019) symmetric convex penalty and (Σ = Ip, p/n → γ), using Approximate Message Passing ideas from statistical physics ◮ logistic regression Sur and Candès (2018) (Σ = Ip, p/n → γ)

slide-11
SLIDE 11

Focus today: General theory for confidence intervals

based on any convex regularized estimators of the form ˆ β = arg minb∈Rp

y − Xb2/(2n) + g(b) ,

g convex.

Little or no constraint on the convex regularizer g.

slide-12
SLIDE 12

Degrees-of-freedom of estimator

ˆ β = arg min

b∈Rp

  • y − Xb2/(2n) + g(b)
  • ◮ then y → X ˆ

β for fixed X is 1-Lipscthiz ◮ the Jacobian of y → X ˆ β exists everywhere (Rademacher’s theorem) ˆ df = trace ∇(y → X ˆ β), ˆ df = trace

  • X ∂ ˆ

β(X, y) ∂y

  • .

used for instance in Stein’s Unbiased Risk Estimate (SURE).

The Jacobian matrix ˆ H is also useful. ˆ H is always symmetric1

ˆ H = X ∂ ˆ β(X, y) ∂y ∈ Rn×n

1P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and

de-biasing arbitrary convex regularizers when p/n → γ

slide-13
SLIDE 13

Isotropic design, any g, p/n → γ (B. and Zhang, 2019)

Assumptions ◮ Sequence of linear regression problems y = Xβ + ε ◮ with n, p → +∞ and p/n → γ ∈ (0, ∞), ◮ g : Rp → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N(0, Ip) and ◮ Noise ε ∼ N(0, σ2In) is independent of X.

slide-14
SLIDE 14

Isotropic design, any penalty g, p/n → γ

Theorem (B. and Zhang, 2019)

ˆ β = arg min

b∈Rp

  • y − Xb2/(2n) + g(b)
  • ◮ βj = ej, β parameter of interest

◮ ˆ H = X(∂/∂y)ˆ β, ˆ df = trace ˆ H, ◮ ˆ V (βj) = y − X ˆ β2 + trace[( ˆ H − In)2]( βj − βj)2. Then there exists a subset Jp ⊂ [p] of size at least (p − log log p) s.t. sup

j∈Jp

  • P

(n − ˆ

df)( βj − βj) + e⊤

j X⊤(y − X ˆ

β) ˆ V (βj)1/2 ≤ t

  • − Φ(t)
  • → 0.
slide-15
SLIDE 15

Correlated design, any g, p/n → γ

Assumption ◮ Sequence of linear regression problems y = Xβ + ε ◮ with n, p → +∞ and p/n → γ ∈ (0, ∞), ◮ g : Rp → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N(0, Σ) and ◮ Noise ε ∼ N(0, σ2In) is independent of X.

slide-16
SLIDE 16

Correlated design, any penalty g, p/n → γ

Theorem (B. and Zhang, 2019)

ˆ β = arg min

b∈Rp

  • y − Xb2/(2n) + g(b)
  • ◮ θ = a0, β parameter of interest

◮ ˆ H = X(∂/∂y)ˆ β, ˆ df = trace ˆ H, ◮ ˆ V (θ) = y − X ˆ β2 + trace[( ˆ H − In)2](a0, ˆ β − θ)2. ◮ Assume a⊤

0 Σa0 = 1 and set

z0 = Σ−1a0. Then there exists a subset S ⊂ Sp−1 with relative volume |S|/|Sp−1| ≥ 1 − 2e−p0.99 sup

a0∈Σ1/2S

  • P

(n − ˆ

df)(ˆ β, a0 − θ) + z0, y − X ˆ β ˆ V (θ)1/2 ≤ t

  • −Φ(t)
  • → 0.

This applies to at least (p − φcond(Σ) log log p) indices j ∈ [p].

slide-17
SLIDE 17

Resulting 0.95 confidence interval

ˆ CI =

  • θ ∈ R :
  • (n − ˆ

df)(ˆ β, a0 − θ) + z0, y − X ˆ β ˆ V (θ)1/2

  • ≤ 1.96
  • Variance approximation

Typically, ˆ V (θ) ≈ y − X ˆ β2 and the length of the interval is 2 · 1.96y − X ˆ β

  • (n − ˆ

df). ˆ CIapprox =

  • θ ∈ R :
  • (n − ˆ

df)(ˆ β, a0 − θ) + z0, y − X ˆ β y − X ˆ β

  • ≤ 1.96
  • .
slide-18
SLIDE 18

Simulations using the approximation ˆ V (θ) ≈ y − X ˆ β2

n = 750, p = 500, correlated Σ. β is the vectorization of a row-sparse matrix of size 25 × 20. a0 is a direction that leads to large initial bias.

Estimators: 7 different penalty functions:

◮ Group Lasso with tuning parameters µ1, µ2 ◮ Lasso with tuning parameters λ1, ..., λ4 ◮ Nuclear norm penalty

Boxplots of initial errors √na⊤

0 (ˆ

β − β) (biased!)

slide-19
SLIDE 19

Simulations using the approximation ˆ V (θ) ≈ y − X ˆ β2

n = 750, p = 500, correlated Σ β is the vectorization of a row-sparse matrix of size 25 × 20

Estimators: 7 different penalty functions:

◮ Group Lasso with tuning parameters µ1, µ2 ◮ Lasso with tuning parameters λ1, ..., λ4 ◮ Nuclear norm penalty

Boxplots of √n[a⊤

0 (ˆ

β − β) + z⊤

0 (y − X ˆ

β)]

slide-20
SLIDE 20

Before/after bias correction

slide-21
SLIDE 21

QQ-plot, Lasso, λ1, λ2, λ3, λ3.

For Lasso, ˆ df =

  • {j = 1, ..., p :

βj = 0}

  • .

Pivotal quantity when using y − X ˆ β2 instead of ˆ V (θ) for the variance. ◮ The visible discrepancy in the last plot is fixed when using ˆ V (θ) instead.

slide-22
SLIDE 22

QQ-plot, Group Lasso, µ1, µ2. Explicit formula for ˆ df

slide-23
SLIDE 23

QQ-plot, Nuclear norm penalty

No explicit formula for ˆ df available, although it is possible to compute numerical approximations.

slide-24
SLIDE 24

Summary of the main result2

Asymptotic normality result, and valid 1 − α confidence interval by de-biasing any convex regularized M estimator. ◮ Asymptotics p/n → γ ◮ Under Gaussian design, known covariance matrix Σ ◮ Strong convexity of the penalty required if γ ≥ 1;

  • therwise any penalty is allowed.

2P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and

de-biasing arbitrary convex regularizers when p/n → γ

slide-25
SLIDE 25

Time-pertmitting

  • 1. Necessity of degrees-of-freedom adjustment
  • 2. Central Limit Theorems and Second Order Poincar’e

inequalities

  • 3. Unknown Σ.
slide-26
SLIDE 26
  • 1. Necessity of degrees-of-freedom adjustment

The previous de-biasing correction features a “degrees-of-freedom” adjustment in the form of multiplication by (1 − ˆ df/n)

  • r depending on the normalization, multiplication by

n − ˆ df.

Generalization, in high-dimensions, of the classical normalization by multiplying by n − p to obtain unbiased estimates when p ≪ n.

This degrees-of-freedom adjustment for the Lasso was initially motivated by statistical physics arguments3

3Javanmard and Montanari (2014b), Hypothesis Testing in High-Dimensional

Regression under the Gaussian Random Design Model: Asymptotic Theory

slide-27
SLIDE 27

Initial proposals for de-biasing the Lasso do not include the “degrees-of-freedom” adjustment

slide-28
SLIDE 28
  • 1. Necessity of degrees-of-freedom adjustment

◮ Sparse linear regression y = Xβ + ε, sparsity s0 = β0 ◮ X has iid N(0, Σ) rows, noise ε ∼ N(0, σ2In) ˆ θν de-biasing estimate with adjustment of the form (1 − ν/n), here ν represents a possible degrees-of-freedom adjustment

  • r absence thereof (ν = 0).

√n(ˆ θν − θ) when the initial estimator is the Lasso

The pivotal quantity for ν = 0 (unadjusted) is biased. (Yellow boxplot). The degrees-of-freedom adjustment exactly repairs this. For s0 ≫ n2/3, absence of degrees-of-freedom adjustment provably leads to incorrect coverage for certain directions a − 0.4

  • 4B. and Zhang (2018): De-Biasing The Lasso With Degrees-of-Freedom

Adjustment

slide-29
SLIDE 29
  • 2. Central Limit Theorems/Second Order Poinacr’e

inequalities

If f : Rn → Rn and z0 ∼ N(0, In), then the random variable z⊤

0 f (z0) − divf (z0)

is close to normal when E∇f (z0)2

F

Ef (z0)2 is small5. ◮ This leads to the asymptotic normal results when de-biasing convex regularizers ◮ Mechanically computing/bounding gradients leads to asymptotic normality results (Second Order Poincar’e inequalities, see Chatterjee (2009))

5P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and

de-biasing arbitrary convex regularizers when p/n → γ

slide-30
SLIDE 30
  • 3. Unknown Σ

The general theory of de-biasing/asymptotic normality for arbitrary regularizers is applicable to any penalty when Σ is known.

In practice, z0 = Σ−1a0 needs to be estimated.

◮ sample splitting ◮ case-by-case basis for a given regularizer g ◮ e.g.: Nodewise Lasso. Dense and sparse a0 have to be handled differently.6

◮ leaves open interesting problems for different regularizers

  • 6B. and Zhang (2018), Section 2.2. De-Biasing The Lasso With

Degrees-of-Freedom Adjustment

slide-31
SLIDE 31

Thank you!

Asymptotic normality result, and valid 1 − α confidence interval7 by de-biasing any convex regularized M estimator. ◮ Asymptotics p/n → γ ◮ Under Gaussian design, known covariance matrix Σ ◮ Strong convexity of the penalty required if γ ≥ 1;

  • therwise any penalty is allowed.

7P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and

de-biasing arbitrary convex regularizers when p/n → γ

slide-32
SLIDE 32

References I

Bayati, Mohsen, and Andrea Montanari. 2012. “The Lasso Risk for Gaussian Matrices.” IEEE Transactions on Information Theory 58 (4). IEEE: 1997–2017. Celentano, Michael, and Andrea Montanari. 2019. “Fundamental Barriers to High-Dimensional Regression with Convex Penalties.” arXiv Preprint arXiv:1903.10603. Chatterjee, Sourav. 2009. “Fluctuations of Eigenvalues and Second Order Poincaré Inequalities.” Probability Theory and Related Fields 143 (1-2). Springer: 1–40. Donoho, David, and Andrea Montanari. 2016. “High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing.” Probability Theory and Related Fields 166 (3-4). Springer: 935–69.

slide-33
SLIDE 33

References II

El Karoui, Noureddine, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. 2013. “On Robust Regression with High-Dimensional Predictors.” Proceedings of the National Academy of Sciences 110 (36). National Acad Sciences: 14557–62. Javanmard, Adel, and Andrea Montanari. 2014a. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” The Journal of Machine Learning Research 15 (1). JMLR. org: 2869–2909. ———. 2014b. “Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory.” IEEE Transactions on Information Theory 60 (10). IEEE: 6522–54. ———. 2018. “Debiasing the Lasso: Optimal Sample Size for Gaussian Designs.” The Annals of Statistics 46 (6A). Institute of Mathematical Statistics: 2593–2622.

slide-34
SLIDE 34

References III

Lei, Lihua, Peter J Bickel, and Noureddine El Karoui. 2018. “Asymptotics for High Dimensional Regression M-Estimates: Fixed Design Results.” Probability Theory and Related Fields 172 (3-4). Springer: 983–1079. Miolane, Léo, and Andrea Montanari. 2018. “The Distribution of the Lasso: Uniform Control over Sparse Balls and Adaptive Parameter Tuning.” arXiv Preprint arXiv:1811.01212. Sur, Pragya, and Emmanuel J Candès. 2018. “A Modern Maximum-Likelihood Theory for High-Dimensional Logistic Regression.” arXiv Preprint arXiv:1803.06964. Van de Geer, Sara, Peter Bühlmann, Ya’acov Ritov, and Ruben

  • Dezeure. 2014. “On Asymptotically Optimal Confidence Regions

and Tests for High-Dimensional Models.” The Annals of Statistics 42 (3). Institute of Mathematical Statistics: 1166–1202.

slide-35
SLIDE 35

References IV

Zhang, Cun-Hui, and Stephanie S Zhang. 2014. “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1). Wiley Online Library: 217–42.