Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for - PowerPoint PPT Presentation

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda

Sparse regression Linear regression is challenging when the number of features p is large Solution: Select subset of features I ⊂ { 1 , . . . , p } , such that � y ≈ β [ i ] x [ i ] i ∈I Equivalently, find sparse coefficient vector β ∈ R p such that y ≈ � x , β � Problem: How to promote sparsity?

Toy problem Find t such that   t v t := t − 1   t − 1 is sparse Equivalently, find arg min t || v t || 0

ℓ 0 “norm" Number of nonzero entries in a vector Not a norm! || 2 x || 0 = || x || 0 � = 2 || x || 0

Toy problem || v t || 0 3 2 . 5 2 1 . 5 1 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 t

Alternative strategy Minimize another norm f ( t ) := || v t ||

Toy problem 5 || v t || 0 || v t || 1 4 || v t || 2 || v t || ∞ 3 2 1 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 t

The lasso Convexity Subgradients Analysis of the lasso estimator for a simple example

Sparse linear regression Find a small subset of useful features Model selection problem Two objectives: � 2 ◮ Good fit to the data; � �� X T β − y � �� 2 should be as small as possible ◮ Using a small number of features; β should be as sparse as possible

The lasso Uses ℓ 1 -norm regularization to promote sparse coefficients 1 2 � � � � � y − X T β β lasso := arg min 2 + λ || β || 1 � � � � 2 � � � β

Temperature prediction via linear regression ◮ Dataset of hourly temperatures measured at weather stations all over the US ◮ Goal: Predict temperature in Jamestown (North Dakota) from other temperatures ◮ Response: Temperature in Jamestown ◮ Features: Temperatures in 133 other stations ( p = 133) in 2015 ◮ Test set: 10 3 measurements ◮ Additional test set: All measurements from 2016

Ridge regression n := 135 1.5 WolfPoint, MT Aberdeen, SD 1.0 Buffalo, SD 0.5 Coefficients 0.0 0.5 1.0 1.5 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Regularization parameter ( / n )

Lasso n := 135 0.75 WolfPoint, MT Aberdeen, SD 0.50 Buffalo, SD 0.25 Coefficients 0.00 0.25 0.50 0.75 1.00 5 4 1 10 1 10 10 10 3 10 2 10 10 0 10 2 Regularization parameter ( / n )

Lasso n := 135 12 Average error (deg Celsius) 10 8 6 4 2 Training error Validation error 0 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 Regularization parameter ( )

Lasso n := 135 Regularization parameter ( ) 10 2 10 3 10 4 10 5 10 2 10 3 10 4 Number of training data (n)

Ridge-regression coefficients 0.6 0.4 0.2 Coefficients 0.0 0.2 WolfPoint, MT 0.4 Aberdeen, SD Buffalo, SD 0.6 10 2 10 3 10 4 Number of training data

Lasso coefficients 0.6 0.4 0.2 Coefficients 0.0 0.2 WolfPoint, MT 0.4 Aberdeen, SD Buffalo, SD 0.6 10 2 10 3 10 4 Number of training data

Results Training error (RR) 3.0 Test error (RR) Test error 2016 (RR) Average error (deg Celsius) Training error (lasso) 2.5 Test error (lasso) Test error 2016 (lasso) 2.0 1.5 1.0 10 2 10 3 10 4 Number of training data

Convex functions A function f : R n → R is convex if for any x , y ∈ R n and any θ ∈ ( 0 , 1 ) θ f ( x ) + ( 1 − θ ) f ( y ) ≥ f ( θ x + ( 1 − θ ) y )

Convex functions f ( y ) θ f ( x ) + ( 1 − θ ) f ( y ) f ( θ x + ( 1 − θ ) y ) f ( x )

Strictly convex functions A function f : R n → R is strictly convex if for any x , y ∈ R n and any θ ∈ ( 0 , 1 ) θ f ( x ) + ( 1 − θ ) f ( y ) > f ( θ x + ( 1 − θ ) y )

Linear and quadratic functions Linear functions are convex f ( θ x + ( 1 − θ ) y ) = θ f ( x ) + ( 1 − θ ) f ( y ) Positive definite quadratic forms are strictly convex

Norms are convex For any x , y ∈ R n and any θ ∈ ( 0 , 1 ) || θ x + ( 1 − θ ) y || ≤ || θ x || + || ( 1 − θ ) y || = θ || x || + ( 1 − θ ) || y ||

ℓ 0 “norm" is not convex Let x := ( 1 0 ) and y := ( 0 1 ) , for any θ ∈ ( 0 , 1 ) || θ x + ( 1 − θ ) y || 0 = 2 θ || x || 0 + ( 1 − θ ) || y || 0 = 1

Is the lasso cost function convex? f strictly convex, g convex, h := f + λ g ? h ( θ x + ( 1 − θ ) y ) = f ( θ x + ( 1 − θ ) y ) + λ g ( θ x + ( 1 − θ ) y ) < θ f ( x ) + ( 1 − θ ) f ( y ) + λθ g ( x ) + λ ( 1 − θ ) g ( y ) = θ h ( x ) + ( 1 − θ ) h ( y )

Lasso cost function is convex Sum of convex functions is convex If at least one is strictly convex, then sum is strictly convex Scaling by a positive factor preserves convexity Lasso cost function is convex!

Local minima are global Any local minimum of a convex function is also a global minimum

Strictly convex functions Strictly convex functions have at most one global minimum Proof: Assume two minima exist at x � = y with value v min f ( 0 . 5 x + 0 . 5 y ) < 0 . 5 f ( x ) + 0 . 5 f ( y ) = v min

Epigraph The epigraph of f : R n → R is     x [ 1 ]      ≤ x [ n + 1 ] epi ( f ) :=  x | f · · ·    x [ n ] 

Epigraph epi ( f ) f

Supporting hyperplane A hyperplane H is a supporting hyperplane of a set S at x if ◮ H and S intersect at x ◮ S is contained in one of the half-spaces bounded by H

Supporting hyperplane

Subgradient A function f : R n → R is convex if and only if it has a supporting hyperplane at every point It is strictly convex if and only for all x ∈ R n it only intersects with the supporting hyperplane at one point

Subgradients The subgradient of f : R n → R at x ∈ R n is a vector g ∈ R n such that f ( y ) ≥ f ( x ) + g T ( y − x ) , for all y ∈ R n The hyperplane     y [ 1 ]      y | y [ n + 1 ] = g T H g := · · ·     y [ n ]  is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

Subgradients

Subgradient of differentiable function If a function is differentiable, the only subgradient at each point is the gradient

Proof Assume g is a subgradient at x , for any α ≥ 0 f ( x + α e i ) ≥ f ( x ) + g T α e i = f ( x ) + g [ i ] α f ( x ) ≤ f ( x − α e i ) + g T α e i = f ( x − α e i ) + g [ i ] α Combining both inequalities f ( x ) − f ( x − α e i ) ≤ g [ i ] ≤ f ( x + α e i ) − f ( x ) α α Letting α → 0, implies g [ i ] = ∂ f ( x ) ∂ x [ i ]

Optimality condition for nondifferentiable functions x is a minimum of f if and only if the zero vector is a subgradient of f at x 0 T ( y − x ) f ( y ) ≥ f ( x ) + � = f ( x ) for all y ∈ R n Under strict convexity the minimum is unique

Sum of subgradients Let g 1 and g 2 be subgradients at x ∈ R n of f 1 : R n → R and f 2 : R n → R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y ∈ R n f ( y ) = f 1 ( y ) + f 2 ( y ) ≥ f 1 ( x ) + g T 1 ( y − x ) + f 2 ( y ) + g T 2 ( y − x ) ≥ f ( x ) + g T ( y − x )

Subgradient of scaled function Let g 1 be a subgradient at x ∈ R n of f 1 : R n → R For any α ≥ 0 g 2 := α g 1 is a subgradient of f 2 := α f 1 at x Proof: For any y ∈ R n f 2 ( y ) = α f 1 ( y ) � � f 1 ( x ) + g T ≥ α 1 ( y − x ) ≥ f 2 ( x ) + g T 2 ( y − x )

Subdifferential of absolute value At x � = 0, f ( x ) = | x | is differentiable, so g = sign ( x ) At x = 0, we need f ( 0 + y ) ≥ f ( 0 ) + g ( y − 0 ) | y | ≥ gy Holds if and only if | g | ≤ 1

Subdifferential of absolute value f ( x ) = | x |

Subdifferential of ℓ 1 norm g is a subgradient of the ℓ 1 norm at x ∈ R n if and only if g [ i ] = sign ( x [ i ]) if x [ i ] � = 0 | g [ i ] | ≤ 1 if x [ i ] = 0

Proof (one direction) Assume g [ i ] is a subgradient of |·| at | x [ i ] | for 1 ≤ i ≤ n For any y ∈ R n n � || y || 1 = | y [ i ] | i = 1 n � ≥ | x [ i ] | + g [ i ] ( y [ i ] − x [ i ]) i = 1 = || x || 1 + g T ( y − x )

Subdifferential of ℓ 1 norm

Additive model y train := X T β true + ˜ ˜ z train Goal: Gain intuition about why the lasso promotes sparse solutions

Decomposition of lasso cost function y train − X T β � 2 arg min β � ˜ 2 + λ || β || 1 β ( β − β true ) T XX T ( β − β true )+ λ || β || 1 − 2 ˜ z T train X T β = arg min

Sparse regression with two features One true feature y := x true + ˜ ˜ z We fit a model using an additional feature � T � X := x true x other � 1 � β true := 0

( β − β true ) T XX T ( β − β true ) 0 . 8 2.00 0 . 1.00 0 . 6 5 0 0 . 4 0.25 0 . 2 β true β [2] 0 . 0 0.01 2.00 1.00 0.50 − 0 . 2 0.10 0.10 − 0 . 4 4.00 − 0 . 6 − 0 . 8 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 β [1]

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for - PowerPoint PPT Presentation

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Sparse regression Linear regression is challenging when the number of features

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Minimax Rates for Memory-Constrained Sparse Linear Regression Jacob Steinhardt John Duchi

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Summary Key topics. Familiarity with form of basic network gradient. Deep network

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

On the tightness of SDP relaxations of QCQPs Alex L. Wang 1 and Fatma Kln-Karzan 1 1 Carnegie

Gradient and Epigraph (contd) x 2 ( x , x ) = 2 As an example, consider the paraboloid, f + x

Primary objectives: Convex optimization Ellipsoid method A polynomial algorithm for

Recent Progress on Error Bounds for Structured Convex Programming Zirui Zhou Joint work with

MATH 4211/6211 Optimization Convex Optimization Problems Xiaojing Ye Department of

RSLIS at INEX 2011 Social Book Search track Toine Bogers Kirstine Wilfred Christensen Birger

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for - PowerPoint PPT Presentation

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Sparse regression Linear regression is challenging when the number of features

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Minimax Rates for Memory-Constrained Sparse Linear Regression Jacob Steinhardt John Duchi

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Summary Key topics. Familiarity with form of basic network gradient. Deep network

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

On the tightness of SDP relaxations of QCQPs Alex L. Wang 1 and Fatma Kln-Karzan 1 1 Carnegie

Gradient and Epigraph (contd) x 2 ( x , x ) = 2 As an example, consider the paraboloid, f + x

Primary objectives: Convex optimization Ellipsoid method A polynomial algorithm for

Recent Progress on Error Bounds for Structured Convex Programming Zirui Zhou Joint work with

MATH 4211/6211 Optimization Convex Optimization Problems Xiaojing Ye Department of

RSLIS at INEX 2011 Social Book Search track Toine Bogers Kirstine Wilfred Christensen Birger

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and