The Strong Screening Rule for SLOPE Statistical Learning Seminar - PowerPoint PPT Presentation

The Strong Screening Rule for SLOPE Statistical Learning Seminar Johan Larsson 1 Małgorzata Bogdan 1,2 Jonas Wallin 1 1 Department of Statistics, Lund University, 2 Department of Mathematics, University of Wroclaw May 8, 2020

Recap: SLOPE The SLOPE (Bogdan et al. 2015) estimate is ˆ β = arg min { g ( β ) + J ( β ; λ ) } β ∈ R p where J ( β ; λ ) = � p i =1 λ i | β | ( i ) is the sorted ℓ 1 norm , where λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0 , | β | (1) ≥ | β | (2) ≥ · · · ≥ | β | ( p ) . Here we are interested in fitting a β 2 path of regularization penalties λ (1) , λ (2) , . . . , λ ( m ) h ( β ; y, X ) We will let ˆ β ( λ ( i ) ) correspond to 2 the solution to SLOPE at the i th step on the path. ˆ β β 1 2 1 / 18

Predictor screening rules motivation many of the solutions, ˆ β , along the regularization path will be sparse , which means some predictors (columns) in X will be inactive , especially if p ≫ n 2 / 18

Predictor screening rules motivation many of the solutions, ˆ β , along the regularization path will be sparse , which means some predictors (columns) in X will be inactive , especially if p ≫ n basic idea what if we could, based on a relatively cheap test, determine which predictors will be inactive before fitting the model? 2 / 18

Predictor screening rules motivation many of the solutions, ˆ β , along the regularization path will be sparse , which means some predictors (columns) in X will be inactive , especially if p ≫ n basic idea what if we could, based on a relatively cheap test, determine which predictors will be inactive before fitting the model? it turns out we can! safe rules certifies that discarded predictors are not in model heuristic rules may incorrectly discard some predictors, which means problem must sometimes be solved several times (in practice never more than twice) 2 / 18

Motivation for lasso strong rule Assume we are solving the lasso, i.e. minimizing p � g ( β ) + h ( β ) , h ( β ) := λ | β i | . i =1 KKT stationarity condition is 0 ∈ ∇ g (ˆ β ) + ∂h (ˆ β ) , where ∂h (ˆ β ) is the subdifferential for the ℓ 1 norm with elements given by � sign(ˆ ˆ β i ) λ β i � = 0 ∂h (ˆ β ) i = ˆ [ − λ, λ ] β i = 0 , which means that |∇ g (ˆ ⇒ ˆ β ) i | < λ = β i = 0 . 3 / 18

Gradient estimate Assume that we are fitting a regularization path and have ˆ β ( λ ( k − 1) ) —the solution for λ ( k − 1) —and want to discard predictors corresponding to the problem for λ ( k ) . Basic idea: replace ∇ g (ˆ β ) with an estimate and apply the KKT stationarity criterion, discarding predictors that are estimated to be zero. What estimate should we use? 4 / 18

The unit slope bound A simple (and conservative) estimate turns out to be λ ( k − 1) − λ ( k ) , i.e. assume that the gradient is piece-wise linear function with slope bounded by 1. 1 � � ˆ β 0 . 5 ∇ g λ ( k − 1) − λ ( k ) 0 λ ( k ) λ ( k − 1) 0 0 . 5 1 λ 5 / 18

The strong rule for the lasso Discard the j th predictor if � � �� + λ ( k − 1) − λ ( k ) ˆ � β ( λ ( k − 1) ) � < λ ( k ) � ∇ g � � �� unit slope bound previous gradient � �� gradient prediction for k ⇐ ⇒ � � �� < 2 λ ( k ) − λ ( k − 1) ˆ β ( λ ( k − 1) ) � � � ∇ g Empirical results show that the strong rule leads to remarkable performance improvements in p ≫ n regime (and no penalty otherwise) (Tibshirani et al. 2012). 6 / 18

Strong rule for lasso in action 2 1 β ) ∇ g (ˆ 0 strong bound − 1 − 2 λ ( k ) λ ( k − 1) 0 0 . 5 1 1 . 5 2 λ 7 / 18

Strong rule for SLOPE Exactly the same idea as for lasso strong rule. The subdifferential for SLOPE is is the set of all g ∈ R p such that    cumsum( | s | ↓ − λ R ( s ) A i ) � 0 if β A i = 0 ,       s ∈ R card A i � g A i = cumsum( | s | ↓ − λ R ( s ) A i ) � 0 �   � �  ∧ �    | s j | − λ R ( s ) j = 0 otherwise. j ∈A i A i defines a cluster containing indices of coefficients equal in absolute value. R ( x ) is an operator that returns the ranks of elements in | x | . | x | ↓ returns the absolute values of x sorted in non-increasing order. 8 / 18

Strong rule algorithm for SLOPE Require: c ∈ R p , λ ∈ R p , where λ 1 ≥ · · · ≥ λ p ≥ 0 . 1: S , B ← ∅ 2: for i ← 1 , . . . , p do B ← B ∪ { i } 3: if � � � c j − λ j ≥ 0 then 4: j ∈B S ← S ∪ B 5: B ← ∅ 6: end if 7: 8: end for 9: Return S Set β ) + λ ( k − 1) − λ ( k ) | ↓ c := |∇ g (ˆ λ := λ ( k ) , and run the algorithm above; the result is the predicted support for ˆ β ( λ ( k ) ) (subject to a permutation). 9 / 18

Efficiency for simulated data screened active 1.0 0.8 0.6 0.4 0.2 ρ ρ ρ : 0 : 0.2 : 0.4 4000 number of predictors 3000 2000 1000 0 1.0 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 0.2 σ max ( σ ) Figure 1: Gaussian design, X ∈ R 200 × 5000 , predictors pairwise correlated with correlation ρ . There were no violations of the strong rule here. 10 / 18

Efficiency for real data screened active 0 20 40 60 80 100 OLS logistic 1.0 0.8 gisette 0.6 0.4 0.2 0.0 1.0 fraction of predictors 0.8 golub 0.6 0.4 0.2 0.0 1.0 0.8 arcene 0.6 0.4 0.2 0.0 1.0 dorothea 0.8 0.6 0.4 0.2 0.0 0 20 40 60 80 100 penalty index Figure 2: Efficiency for real data sets. The dimensions of the predictor matrices are 100 × 9920 (arcene), 800 × 88119 (dorothea), 6000 × 4955 (gisette), and 38 × 7129 (golub). 11 / 18

Violations Violations may occur if the unit slope bound fails, which can occur if ordering permutation of absolute gradient changes, or any predictor becomes active between λ ( k − 1) and λ ( k ) . Thankfully, such violations turn out to be rare . 1 0.5 0.20.1 0.02 1 0.5 0.20.1 0.02 fraction of fits with violations p : 20 p : 50 p : 100 p : 500 p : 1000 0.020 0.015 0.010 0.005 0.000 1 0.5 0.20.1 0.02 1 0.5 0.20.1 0.02 1 0.5 0.20.1 0.02 σ max ( σ ) Figure 3: Violations for sorted ℓ 1 regularized least squares regression with predictors pairwise correlated with ρ = 0 . 5 . X ∈ R 100 × p . 12 / 18

Performance screening no screening multinomial 0.999 0.99 0.5 0 10 100 0.999 poisson 0.99 0.5 0 ρ 10 100 0.999 logistic 0.99 0.5 0 10 100 0.999 OLS 0.99 0.5 0 1 10 100 time (s) Figure 4: Performance benchmarks for various generalized linear models with X ∈ R 200 × 20000 . Predictors are autocorrelated through an AR(1) process with correlation ρ . 13 / 18

Algorithms The original strong rule paper (Tibshirani et al. 2012) presents two strategies for using the screening rule. For SLOPE, we have two slightly modified versions of these algorithms strong set algorithm initialize E with strong rule set 1. fit SLOPE to predictors in E 2. check KKT criteria against E C ; if there are any failures, add predictors that fail the check to E and go back to 1 14 / 18

Algorithms The original strong rule paper (Tibshirani et al. 2012) presents two strategies for using the screening rule. For SLOPE, we have two slightly modified versions of these algorithms strong set algorithm initialize E with strong rule set 1. fit SLOPE to predictors in E 2. check KKT criteria against E C ; if there are any failures, add predictors that fail the check to E and go back to 1 previous set algorithm initialize E with ever-active predictors 1. fit SLOPE to predictors in E 2. check KKT criteria against predictors in strong set • if there are any failures, include these predictors in E and go back to 1 • if there are no failures, check KKT criteria against remaining predictors; if there are any failures, add these to E and go back to 1 14 / 18

Comparing algorithms 20 g n 15 o r t time (s) s s u o Strong set strategy marginally v i e 10 r p better for low–medium correlation 5 Previous set strategy starts to become useful for high correlation 0.0 0.2 0.4 0.6 0.8 ρ Figure 5: Performance of strong and previous set strategies for OLS problems with varying correlation between predictors. 15 / 18

Limitations • the unit slope bound is generally very conservative • does not use second-order structure in any way • current methods for solving SLOPE (FISTA, ADMM) do not make as good use of screening rules as coordinate descent does (for the lasso) 16 / 18

The SLOPE package for R Strong screening rule for SLOPE has been implemented in the R package SLOPE (https://CRAN.R-project.org/package=SLOPE). Features include • OLS, logistic, Poisson, and multinomial models • support for sparse and dense predictors • cross-validation • efficient codebase in C++ Also have a Google Summer of Code student involved in implementing proximal Newton solver for SLOPE this summer. 17 / 18

References I Małgorzata Bogdan et al. “SLOPE - Adaptive Variable Selection via Convex Optimization”. In: The annals of applied statistics 9.3 (2015), pp. 1103–1140. issn : 1932-6157. doi : 10.1214/15-AOAS842. Robert Tibshirani et al. “Strong Rules for Discarding Predictors in Lasso-Type Problems”. English. In: Journal of the Royal Statistical Society. Series B: Statistical Methodology 74.2 (Mar. 2012), pp. 245–266. issn : 1369-7412. doi : 10/c4bb85. 18 / 18

The Strong Screening Rule for SLOPE Statistical Learning Seminar - PowerPoint PPT Presentation

The Strong Screening Rule for SLOPE Statistical Learning Seminar Johan Larsson 1 Magorzata Bogdan 1,2 Jonas Wallin 1 1 Department of Statistics, Lund University, 2 Department of Mathematics, University of Wroclaw May 8, 2020 Recap: SLOPE The

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Slope Stability Slope Stability loader Lower San Fernando Dam Failure, 1971 Outlines

Metal - - screening screening Metal Thomas-Fermi (static) screening potential of point charge

Section1.4 Equations of Lines and Modeling FindingtheEquationof Line Point-Slope Form If you

Preconstruction Conditions Slope showing stabilization problems Overgrown slope Phase I

Slope Stability Dr. Hend AlShatnawi Hashemite University Class of 2019-2020 Slope Stability

Colorectal Cancer Screening Fall 2018 Agenda CRC Screening Landscape Colonoscopy: The

DIABETIC EYE SCREENING What is Diabetic Eye Screening? Diabetic eye screening means taking a

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

Diabetic Eye Screening Extended Screening Intervals Public Health England leads the NHS Screening

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

MATH 12002 - CALCULUS I 2.1: Derivative and Slope Examples Professor Donald L. White

1 Problems with this? Classic issues Only works if all the events show up as Threads that

BC Asset Management Webinar September 24, 2010 Gaetan Royer Leadership and Asset Management

for a sign! But none will be given it except the sign of the prophet Jonah. 40 40 For as Jonah was

Slope Fields and Eulers Method 10/31/2011 Warm up Suppose dy dx = y x 1. Sketch part

Level-Planar Drawings with Few Slopes Guido Br uckner Nadine Krisam Tamara Mchedlidze Level

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Slope of a Secant MCV4U: Calculus & Vectors Recall that a secant is a line segment that

The Strong Screening Rule for SLOPE Statistical Learning Seminar - PowerPoint PPT Presentation

The Strong Screening Rule for SLOPE Statistical Learning Seminar Johan Larsson 1 Magorzata Bogdan 1,2 Jonas Wallin 1 1 Department of Statistics, Lund University, 2 Department of Mathematics, University of Wroclaw May 8, 2020 Recap: SLOPE The

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Slope Stability Slope Stability loader Lower San Fernando Dam Failure, 1971 Outlines

Metal - - screening screening Metal Thomas-Fermi (static) screening potential of point charge

Section1.4 Equations of Lines and Modeling FindingtheEquationof Line Point-Slope Form If you

Preconstruction Conditions Slope showing stabilization problems Overgrown slope Phase I

Slope Stability Dr. Hend AlShatnawi Hashemite University Class of 2019-2020 Slope Stability

Colorectal Cancer Screening Fall 2018 Agenda CRC Screening Landscape Colonoscopy: The

DIABETIC EYE SCREENING What is Diabetic Eye Screening? Diabetic eye screening means taking a

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

Diabetic Eye Screening Extended Screening Intervals Public Health England leads the NHS Screening

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

MATH 12002 - CALCULUS I 2.1: Derivative and Slope Examples Professor Donald L. White

1 Problems with this? Classic issues Only works if all the events show up as Threads that

BC Asset Management Webinar September 24, 2010 Gaetan Royer Leadership and Asset Management

for a sign! But none will be given it except the sign of the prophet Jonah. 40 40 For as Jonah was

Slope Fields and Eulers Method 10/31/2011 Warm up Suppose dy dx = y x 1. Sketch part

Level-Planar Drawings with Few Slopes Guido Br uckner Nadine Krisam Tamara Mchedlidze Level

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Slope of a Secant MCV4U: Calculus &amp; Vectors Recall that a secant is a line segment that

Slope of a Secant MCV4U: Calculus & Vectors Recall that a secant is a line segment that