De-biasing arbitrary convex regularizers and asymptotic normality - PowerPoint PPT Presentation

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020

Joint work with Cun-Hui Zhang (Rutgers). ◮ Second order Poincaré inequalities and de-biasing arbitrary convex regularizers arXiv:1912.11943 ◮ De-biasing the Lasso with degrees-of-freedom adjustment. arXiv:1902.08885.

High-dimensional statistics ◮ n data points ( x i , Y i , i = 1 , ..., n ) ◮ p covariates, x i ∈ R p p ≥ n α p ≥ n , p ≥ cn For instance, linear model Y i = x ⊤ i β + ǫ i for unknown β

M-estimators and regularization � � n � 1 ˆ ℓ ( x ⊤ β = arg min i b , Y i ) + regularizer( b ) n b ∈ R p i =1 for some loss ℓ ( · , · ) and regularization penalty. Typically in the linear model, with the least-squares loss, � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p with g convex. Example ◮ Lasso, Elastic-Net ◮ Bridge g ( b ) = � p j =1 | b j | c ◮ Group-Lasso ◮ Nuclear Norm penalty ◮ Sorted L1 penalty (SLOPE)

Different goals, different scales � � y − Xb � 2 / (2 n ) + g ( b ) � , ˆ β = arg min b ∈ R p g convex 1. Design of regularizer g with intuition about complexity, structure ◮ convex relaxation of unknown structure (sparsity, low-rank) ◮ ℓ 1 balls are spiky at sparse vectors 2. Upper and lower bounds on the risk of ˆ β : β − β � 2 ≤ Cr n . cr n ≤ � ˆ 3. Characterization of the risk β − β � 2 = r n (1 + o P (1)) � ˆ under some asymptotics, e.g., p / n → γ or s log( p / s ) / n → 0. 4. Asymp. distribution in fixed direction a 0 ∈ R p (resp a 0 = e j ) and confidence interval for a ⊤ 0 β (resp β j ) √ n a ⊤ √ n ( � β − β ) → ? N (0 , V 0 ) , β j − β j ) → ? N (0 , V j ) . 0 (ˆ

Focus of today: Confidence interval in the linear model based on convex regularized estimators of the form � , � � y − Xb � 2 / (2 n ) + g ( b ) ˆ β = arg min b ∈ R p g convex √ n (ˆ b j − β j ) ⇒ N (0 , V j ) , β j unknown parameter of interest

Confidence interval in the linear model Design X with iid N (0 , Σ ) rows, known Σ , noise ε ∼ N (0 , σ 2 I n ), and a given initial estimator ˆ y = X β + ε , β . Goal: Inference for θ = a ⊤ 0 β , projection in direction a 0 Examples: ◮ a 0 = e j , interested in inference on the j -th coefficient β j ◮ a 0 = x new where x new is the characteristics of a new patient, inference for x new ⊤ β .

De-biasing, confidence intervals for the Lasso

Confidence interval in the linear model Design X with iid N (0 , Σ ) rows, known Σ , noise ε ∼ N (0 , σ 2 I n ), and a given initial estimator ˆ y = X β + ε , β . Goal: Inference for θ = a ⊤ 0 β , projection in direction a 0 Examples: ◮ a 0 = e j , interested in inference on the j -th coefficient β j ◮ a 0 = x new where x new is the characteristics of a new patient, inference for x new ⊤ β . De-biasing: construct an unbiased estimate in the direction a 0 0 ˆ i.e., find a correction such that [ a ⊤ β − correction] is an unbiased estimator of a ⊤ 0 β ∗

Existing results Lasso ◮ Zhang and Zhang (2014) ( s log( p / s ) / n → 0) ◮ Javanmard and Montanari (2014a) ; Javanmard and Montanari (2014b) ; Javanmard and Montanari (2018) ( s log ( p / s ) / n → 0) ◮ Van de Geer et al. (2014) ( s log( p / s ) / n → 0) ◮ Bayati and Montanari (2012) ; Miolane and Montanari (2018) ( p / n → γ ) Beyond Lasso? ◮ Robust M -estimators El Karoui et al. (2013) Lei, Bickel, and El Karoui (2018) Donoho and Montanari (2016) ( p / n → γ ) ◮ Celentano and Montanari (2019) symmetric convex penalty and ( Σ = I p , p / n → γ ), using Approximate Message Passing ideas from statistical physics ◮ logistic regression Sur and Candès (2018) ( Σ = I p , p / n → γ )

Focus today: General theory for confidence intervals based on any convex regularized estimators of the form � , � � y − Xb � 2 / (2 n ) + g ( b ) ˆ β = arg min b ∈ R p g convex. Little or no constraint on the convex regularizer g .

Degrees-of-freedom of estimator � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ then y → X ˆ β for fixed X is 1-Lipscthiz ◮ the Jacobian of y �→ X ˆ β exists everywhere (Rademacher’s theorem) � X ∂ ˆ � β ( X , y ) df = trace ∇ ( y �→ X ˆ ˆ ˆ β ) , df = trace . ∂ y used for instance in Stein’s Unbiased Risk Estimate (SURE). The Jacobian matrix ˆ H is also useful. ˆ H is always symmetric 1 H = X ∂ ˆ β ( X , y ) ˆ ∈ R n × n ∂ y 1 P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and de-biasing arbitrary convex regularizers when p / n → γ

Isotropic design, any g , p / n → γ (B. and Zhang, 2019) Assumptions ◮ Sequence of linear regression problems y = X β + ε ◮ with n , p → + ∞ and p / n → γ ∈ (0 , ∞ ), ◮ g : R p → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N ( 0 , I p ) and ◮ Noise ε ∼ N (0 , σ 2 I n ) is independent of X .

Isotropic design, any penalty g , p / n → γ Theorem (B. and Zhang, 2019) � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ β j = � e j , β � parameter of interest ◮ ˆ H = X ( ∂/∂ y )ˆ df = trace ˆ ˆ β , H , β � 2 + trace[( ˆ ◮ ˆ V ( β j ) = � y − X ˆ H − I n ) 2 ]( � β j − β j ) 2 . Then there exists a subset J p ⊂ [ p ] of size at least ( p − log log p ) s.t. � � ( n − ˆ df)( � j X ⊤ ( y − X ˆ � � β j − β j ) + e ⊤ β ) � � sup ≤ t − Φ( t ) � → 0 . � P ˆ V ( β j ) 1 / 2 j ∈ J p

Correlated design, any g , p / n → γ Assumption ◮ Sequence of linear regression problems y = X β + ε ◮ with n , p → + ∞ and p / n → γ ∈ (0 , ∞ ), ◮ g : R p → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N ( 0 , Σ ) and ◮ Noise ε ∼ N (0 , σ 2 I n ) is independent of X .

Correlated design, any penalty g , p / n → γ Theorem (B. and Zhang, 2019) � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ θ = � a 0 , β � parameter of interest ◮ ˆ H = X ( ∂/∂ y )ˆ df = trace ˆ ˆ β , H , β � 2 + trace[( ˆ ◮ ˆ V ( θ ) = � y − X ˆ H − I n ) 2 ]( � a 0 , ˆ β � − θ ) 2 . ◮ Assume a ⊤ 0 Σ a 0 = 1 and set z 0 = Σ − 1 a 0 . Then there exists a subset S ⊂ S p − 1 with relative volume | S | / | S p − 1 | ≥ 1 − 2 e − p 0 . 99 � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ � β � � � sup ≤ t − Φ( t ) � → 0 . � P ˆ V ( θ ) 1 / 2 a 0 ∈ Σ 1 / 2 S This applies to at least ( p − φ cond ( Σ ) log log p ) indices j ∈ [ p ].

Resulting 0 . 95 confidence interval � � � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ β � � � ˆ CI = θ ∈ R : � ≤ 1 . 96 ˆ V ( θ ) 1 / 2 Variance approximation β � 2 and the length of the interval is Typically, ˆ V ( θ ) ≈ � y − X ˆ � 2 · 1 . 96 � y − X ˆ ( n − ˆ β � df) . � � � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ β � � � ˆ CI approx = θ ∈ R : � ≤ 1 . 96 . � y − X ˆ β �

Simulations using the approximation ˆ V ( θ ) ≈ � y − X ˆ β � 2 n = 750, p = 500, correlated Σ . β is the vectorization of a row-sparse matrix of size 25 × 20. a 0 is a direction that leads to large initial bias. Estimators: 7 different penalty functions: ◮ Group Lasso with tuning parameters µ 1 , µ 2 ◮ Lasso with tuning parameters λ 1 , ..., λ 4 ◮ Nuclear norm penalty Boxplots of initial errors √ n a ⊤ 0 (ˆ β − β ) (biased!)

Simulations using the approximation ˆ V ( θ ) ≈ � y − X ˆ β � 2 n = 750, p = 500, correlated Σ β is the vectorization of a row-sparse matrix of size 25 × 20 Estimators: 7 different penalty functions: ◮ Group Lasso with tuning parameters µ 1 , µ 2 ◮ Lasso with tuning parameters λ 1 , ..., λ 4 ◮ Nuclear norm penalty Boxplots of √ n [ a ⊤ 0 (ˆ 0 ( y − X ˆ β − β ) + z ⊤ β )]

Before/after bias correction

QQ-plot, Lasso, λ 1 , λ 2 , λ 3 , λ 3 . � � For Lasso, ˆ � { j = 1 , ..., p : � � . df = β j � = 0 } β � 2 instead of ˆ Pivotal quantity when using � y − X ˆ V ( θ ) for the variance. ◮ The visible discrepancy in the last plot is fixed when using ˆ V ( θ ) instead.

QQ-plot, Group Lasso, µ 1 , µ 2 . Explicit formula for ˆ df

QQ-plot, Nuclear norm penalty No explicit formula for ˆ df available, although it is possible to compute numerical approximations.

Summary of the main result 2 Asymptotic normality result, and valid 1 − α confidence interval by de-biasing any convex regularized M estimator. ◮ Asymptotics p / n → γ ◮ Under Gaussian design, known covariance matrix Σ ◮ Strong convexity of the penalty required if γ ≥ 1; otherwise any penalty is allowed. 2 P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and de-biasing arbitrary convex regularizers when p / n → γ

Time-pertmitting 1. Necessity of degrees-of-freedom adjustment 2. Central Limit Theorems and Second Order Poincar’e inequalities 3. Unknown Σ .

De-biasing arbitrary convex regularizers and asymptotic normality - PowerPoint PPT Presentation

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020 Joint work with Cun-Hui Zhang (Rutgers). Second order Poincar inequalities and

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Exercise 10: Importance biasing FLUKA Beginners Course Exercise 10: Importance biasing Aim of

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with

Status of Generic Biasing Parallel 5B - Biasing & Channeling Fermilab Geant4 Collaboration

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Screening Rules for Lasso with Non-Convex Sparse Regularizers Joseph Salmon

Screening Rules for Lasso with Non-Convex Sparse Regularizers A. Rakotomamonjy Joint work with G.

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&CPS Lisa Perry IEEE-SA

New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg

The FreeMABSys Project and the MABSys Library Franois Lemaire , Asl rgpl. University

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT

Theoretical Analysis of Adversarial Learning: A Minimax Approach Zhuozhuo Tu 1 , Jingwei Zhang 2,1

IPv6 Multicast Over TEIN Pujan Srivastava | pujan@ait.asia Asian Institute of Technology

Cool gas inside early-type galaxies galaxies Timothy A.

De-biasing arbitrary convex regularizers and asymptotic normality - PowerPoint PPT Presentation

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020 Joint work with Cun-Hui Zhang (Rutgers). Second order Poincar inequalities and

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Exercise 10: Importance biasing FLUKA Beginners Course Exercise 10: Importance biasing Aim of

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with

Status of Generic Biasing Parallel 5B - Biasing &amp; Channeling Fermilab Geant4 Collaboration

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Screening Rules for Lasso with Non-Convex Sparse Regularizers Joseph Salmon

Screening Rules for Lasso with Non-Convex Sparse Regularizers A. Rakotomamonjy Joint work with G.

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&amp;CPS Lisa Perry IEEE-SA

New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical

Lattice Points in Polytopes Richard P. Stanley U. Miami &amp; M.I.T. A lattice polygon Georg

The FreeMABSys Project and the MABSys Library Franois Lemaire , Asl rgpl. University

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT

Theoretical Analysis of Adversarial Learning: A Minimax Approach Zhuozhuo Tu 1 , Jingwei Zhang 2,1

IPv6 Multicast Over TEIN Pujan Srivastava | pujan@ait.asia Asian Institute of Technology

Cool gas inside early-type galaxies galaxies Timothy A.

Status of Generic Biasing Parallel 5B - Biasing & Channeling Fermilab Geant4 Collaboration

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&CPS Lisa Perry IEEE-SA

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg