Lecture 9: Regularized/penalized regression Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019

Revisited: Expectation-Maximization (I) New target function: Maximize ] 𝑟(𝐚) and ] 𝑟(𝐚) Choosing 𝑟(𝐚) is therefore a trade-off between same value, irrespective of the chosen 𝑟(𝐚) . 1/24 Note: with respect to 𝑟(𝐚) and 𝜾 ] 𝑟(𝐚) 𝑟(𝐚) log (𝑞(𝐘|𝜾)) = 𝔽 𝑟(𝐚) [ log 𝑞(𝐘, 𝐚|𝜾) ] − 𝔽 𝑟(𝐚) [ log 𝑞(𝐚|𝐘, 𝜾) ▶ The left hand side is independent of 𝑟(𝐚) ▶ The difference on the right hand side has always the 𝔽 𝑟(𝐚) [ log 𝑞(𝐘, 𝐚|𝜾) 𝔽 𝑟(𝐚) [ log 𝑞(𝐚|𝐘, 𝜾)

Regularized/penalized regression

Remember ordinary least-squares (OLS) 1 removes the need to estimate the intercept 𝑜 1 𝑜 1 ( Consider the model 𝑜 variance (4) which are (roughly) normally distributed (5) where 𝐳 = 𝐘𝜸 + 𝜻 3/24 Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant ▶ 𝐳 ∈ ℝ 𝑜 is the outcome , 𝐘 ∈ ℝ 𝑜×(𝑞+1) is the design matrix , 𝜸 ∈ ℝ 𝑞+1 are the regression coefficients , and 𝜻 ∈ ℝ 𝑜 is the additive error ▶ Five basic assumptions have to be checked ▶ Centring ( 𝑜 ∑ 𝑚=1 𝑦 𝑚𝑘 = 0 ) and standardisation 𝑚=1 𝑦 2 𝑚𝑘 = 1 ) of predictors simplifies interpretation 𝑜 ∑ ▶ Centring the outcome ( 𝑜 ∑ 𝑚=1 𝑧 𝑚 = 0 ) and features

Feature selection as motivation Analytical solution exists when 𝐘 𝑈 𝐘 is invertible ̂ This can be unstable or fail in case of Solutions: Regularisation or feature selection 4/24 𝜸 OLS = (𝐘 𝑈 𝐘) −1 𝐘 𝑈 𝐳 ▶ high correlation between predictors, or ▶ if 𝑞 > 𝑜 .

Filtering for feature selection after observing 𝑧 of a proper feature selection step not geared towards a certain method random forests 5/24 correlate most with the response ▶ Choose features through pre-processing ▶ Features with maximum variance ▶ Use only the first 𝑙 PCA components ▶ Examples of other useful measures ▶ Use a univariate criterion, e.g. F-score: Features that ▶ Mutual Information: Reduction in uncertainty about 𝐲 ▶ Variable importance: Determine variable importance with ▶ Summary ▶ Pro: Fast and easy ▶ Con: Filtering mostly operates on single features and is ▶ Care with cross-validation and multiple testing necessary ▶ Filtering is often more of a pre-processing step and less

Wrapping for feature selection then remove sequentially the one with the least impact model) being selected, resulting in a potentially very different variance (small changes could lead to different predictors ( greedy algorithm ) algorithm ) each step the variable that improves fit the most ( greedy performance with e.g. cross-validation many ) subsets of features and compare model of different complexity and comparing their performance 6/24 ▶ Idea: Determine the best set of features by fitting models ▶ Best subset selection: Try all possible ( exponentially ▶ Forward selection: Start with just an intercept and add in ▶ Backward selection: Start with all variables included and ▶ As discreet procedures, all of these methods exhibit high

Embedding for feature selection 𝑘=1 where 𝜇 is a tuning parameter and 𝑟 ≥ 1 or 𝑟 = ∞ . 𝑟 ‖𝐳 − 𝐘𝜸‖ 2 𝜸 𝜸 = arg min ̂ solve However, discrete optimization problems are hard to 7/24 ∑ 𝑞 ‖𝐳 − 𝐘𝜸‖ 2 𝜸 𝜸 = arg min ̂ estimation procedure ▶ Embed/include the feature selection into the model ▶ Ideally, penalization on the number of included features 1 (𝛾 2 + 𝜇 𝑘 ≠ 0) ▶ Softer regularisation methods can help 2 + 𝜇‖𝜸‖ 𝑟

Constrained regression ̂ subgradients) both are differentiable problem. 𝑟 ‖𝐳 − 𝐘𝜸‖ 2 𝜸 The optimization problem 𝜸 = arg min for 𝑟 > 0 is equivalent to ‖𝜸‖ 𝑟 subject to 2 ‖𝐳 − 𝐘𝜸‖ 2 𝜸 arg min 8/24 𝑟 ≤ 𝑢 2 + 𝜇‖𝜸‖ 𝑟 when 𝑟 ≥ 1 . This is the Lagrangian of the constrained ▶ Clear when 𝑟 > 1 : Convex constraint + target function and ▶ Harder to prove for 𝑟 = 1 , but possible (e.g. with

Ridge regression For 𝑟 = 2 the constrained problem is ridge regression ̂ i.e. 1 + 𝜇, 𝜸 OLS ̂ 𝜸 ridge (𝜇) = ̂ If 𝐘 𝑈 𝐘 = 𝐉 𝑞 , then 𝜸 ridge (𝜇) = (𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 ) −1 𝐘 𝑈 𝐳 ̂ 9/24 𝑞 where ‖𝜸‖ 2 2 ‖𝐳 − 𝐘𝜸‖ 2 𝜸 𝜸 ridge (𝜇) = arg min ̂ 2 + 𝜇‖𝜸‖ 2 2 = ∑ 𝑘=1 𝛾 2 𝑘 . An analytical solution exists if 𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 is invertible 𝜸 ridge (𝜇) is biased but has lower variance .

SVD and ridge regression ∑ features. lower eigenvalues , e.g. in presence of correlation between 𝑘 𝐯 𝑈 𝑒 2 𝑘 𝑒 𝑘=1 𝑞 = 𝜸 ridge (𝜇) = (𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 ) −1 𝐘 𝑈 𝐳 ̂ The analytical solution for ridge regression becomes ( 𝑜 ≥ 𝑞 ) 𝐘 = 𝐕𝐄𝐖 𝑈 10/24 Recall: The SVD of a matrix 𝐘 ∈ ℝ 𝑜×𝑞 was = (𝐖𝐄 2 𝐖 𝑈 + 𝜇𝐉 𝑞 ) −1 𝐖𝐄𝐕 𝑈 𝐳 = 𝐖(𝐄 2 + 𝜇𝐉 𝑞 ) −1 𝐄𝐕 𝑈 𝐳 𝑘 𝐳 𝑘 + 𝜇𝐰 Ridge regression acts most on principal components with

Effective degrees of freedom 𝑞 𝑒 2 𝑘 𝑒 2 𝑘=1 ∑ df (𝜇) ∶= tr (𝐈(𝜇)) = and 𝐈(𝜇) ∶= 𝐘(𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 ) −1 𝐘 𝑈 In analogy define for ridge regression regression coefficients. 𝚻 and the degrees of freedom for the tr (𝐼) = tr (𝐘(𝐘 𝑈 𝐘) −1 𝐘 𝑈 ) = tr (𝐘 𝑈 𝐘(𝐘 𝑈 𝐘) −1 ) = tr (𝐉 𝑞 ) = 𝑞 11/24 Recall the hat matrix 𝐈 = 𝐘(𝐘 𝑈 𝐘) −1 𝐘 𝑈 in OLS. The trace of 𝐈 is equal to the trace of ˆ 𝑘 + 𝜇, the effective degrees of freedom .

Lasso regression For 𝑟 = 1 the constrained problem is known as the lasso ̂ 𝜸 ridge (𝜇) = arg min 𝜸 ‖𝐳 − 𝐘𝜸‖ 2 12/24 2 + 𝜇‖𝜸‖ 1 ▶ Smallest 𝑟 in penalty such that constraint is still convex ▶ Performs feature selection

Intuition for the penalties (I) 𝐬 = 𝐳 − 𝐘𝜸 OLS ‖𝐳 − 𝐘𝜸‖ 2 2 = ‖(𝐘(𝜸 − 𝜸 OLS ) − 𝐬‖ 2 2 = (𝜸 − 𝜸 OLS ) 𝑈 𝐘 𝑈 𝐘(𝜸 − 𝜸 OLS ) − 2𝐬 𝑈 𝐘(𝜸 − 𝜸 OLS ) + 𝐬 𝑈 𝐬 13/24 Assume the OLS solution 𝜸 OLS exists and set it follows for the residual sum of squares (RSS) that 2 = ‖(𝐘𝜸 OLS + 𝐬) − 𝐘𝜸‖ 2 which is an ellipse (at least in 2D) centred on 𝜸 OLS .

Intuition for the penalties (II) The least squares RSS is minimized for 𝜸 OLS . If a constraint is The blue lines are the contour lines for the RSS. 14/24 possible that fulfills the constraint. added ( ‖𝜸‖ 𝑟 𝑟 ≤ 𝑢 ) then the RSS is minimized by the closest 𝜸 Lasso Ridge β 1 β 1 ● ● β OLS β OLS ● ● β lasso β ridge β 2 β 2

Intuition for the penalties (III) constrained Depending on 𝑟 the dot. the corresponding solution will be at 15/24 on a line, the different constraints in one of the lead to different coloured areas or q: 0.7 q: 1 β 1 β 1 ● ● ● ● ● ● β 2 β 2 solutions. If 𝜸 OLS is q: 2 q: Inf β 1 β 1 ● ● ● ● ● ● β 2 β 2

Computational aspects of the Lasso (I) What estimates does the lasso produce? non-differentiable penalisation ‖𝜸‖ 1 ? 𝜸 in presence of the ̂ How do we find the solution OLS =𝜸 𝑈 ⏟ 2𝐳 𝑈 𝐳 − 𝐳 𝑈 𝐘 16/24 2‖𝐳 − 𝐘𝜸‖ 2 1 Special case: 𝐘 𝑈 𝐘 = 𝐉 𝑞 . Then 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 arg min Target function 2 + 𝜇‖𝜸‖ 1 2 + 𝜇‖𝜸‖ 1 = 1 𝜸 + 1 2𝜸 𝑈 𝜸 + 𝜇‖𝜸‖ 1 = 𝑕(𝜸)

Computational aspects of the Lasso (II) 𝑘 | otherwise 0 𝑦 > 0 where 𝛾 ˆ Each case results in 17/24 𝑘=1 2𝛾 2 arg min 𝜸 𝑞 ∑ −𝛾 OLS ,𝑘 𝛾 For 𝐘 𝑈 𝐘 = 𝐉 𝑞 the target function can be written as 𝑘 + 1 𝑘 + 𝜇|𝛾 This results in 𝑞 uncoupled optimization problems. ▶ If 𝛾 OLS ,𝑘 > 0 , then 𝛾 𝑘 > 0 to minimize the target ▶ If 𝛾 OLS ,𝑘 ≤ 0 , then 𝛾 𝑘 ≤ 0 𝑘 = sign (𝛾 OLS ,𝑘 )(|𝛾 OLS ,𝑘 | − 𝜇) + = ST (𝛾 OLS ,𝑘 , 𝜇), 𝑦 + = {𝑦 and ST is the soft-thresholding operator

Lecture 9: Regularized/penalized regression Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Regularity results for a penalized boundary obstacle problem Donatella Danielli Purdue

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad

Approximate Bayesian logistic regression via penalized likelihood estimation with data

Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes

Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI)

Sparse Robust Regression using Non-concave Penalized Density Power Divergence Subhabrata Majumdar

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Exponentially weighted aggregation Laplace prior for linear regression Arnak Dalalyan, Edwin

$TITLE: M5-1.GMS, ordinary least squares using NLP $ONTEXT there are I observations on two

Introduction to Big Data and Machine Learning OLS matrix derivation Dr. Mihail August 26, 2019

Explaining Success in Sports Competitions: Paired Comparison Methods with Explanatory Variables

LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) Solution of OLS

Regression Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics &

Introduction to regression Supervised Learning with scikit-learn Boston housing data In [1]:

Slide Set 4 CLRM estimation Pietro Coretto pcoretto@unisa.it Econometrics Master in Economics

Lecture 9: Regularized/penalized regression Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Regularity results for a penalized boundary obstacle problem Donatella Danielli Purdue

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad

Approximate Bayesian logistic regression via penalized likelihood estimation with data

Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes

Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI)

Sparse Robust Regression using Non-concave Penalized Density Power Divergence Subhabrata Majumdar

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Exponentially weighted aggregation Laplace prior for linear regression Arnak Dalalyan, Edwin

$TITLE: M5-1.GMS, ordinary least squares using NLP $ONTEXT there are I observations on two

Introduction to Big Data and Machine Learning OLS matrix derivation Dr. Mihail August 26, 2019

Explaining Success in Sports Competitions: Paired Comparison Methods with Explanatory Variables

LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) Solution of OLS

Regression Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics &amp;

Introduction to regression Supervised Learning with scikit-learn Boston housing data In [1]:

Slide Set 4 CLRM estimation Pietro Coretto pcoretto@unisa.it Econometrics Master in Economics

Regression Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics &