Sketched Ridge Regression: Optimization and Statistical Perspectives - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley

Overview

Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 Over-determined: 𝑜 ≫ 𝑒 𝑜×𝑒

Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 • Efficient and approximate solution? • Use only part of the data? 𝑜×𝑒

Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 Matrix Sketching ng: • Random selection • Random projection

Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 • S ketched solution: 𝐱 O

Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 • S ketched solution: 𝐱 O S R • Sketch size 𝑃 T sketch size

Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 • S ketched solution: 𝐱 O S R • Sketch size 𝑃 T • 𝑔 𝐱 O ≤ 1 + 𝜗 min 𝐱 𝑔 𝐱 sketch size Optimization Perspective

Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 − 𝐳 𝑜 7 7 Statistical Perspective • Bias • Variance

Related Work 7 • Least squares regression: min 𝐘𝐱 − 𝐳 7 𝐱 Reference • Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006. • Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011. • Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC , 2013. • Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research , 2015. • Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares. Journal of Machine Learning Research , 2015. • Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research , 2016. • Etc …

Sketched Ridge Regression

Matrix Sketching • Turn big matrix into smaller one. • 𝐘 ∈ ℝ ^×S ➡ 𝐓 [ 𝐘 ∈ ℝ _×S . • 𝐓 ∈ ℝ ^×_ is called sketching matrix, e.g., • Uniform sampling • Leverage score sampling • Gaussian projection • Subsampled randomized Hadamard transform (SRHT) • Count sketch (sparse embedding) • Etc. 𝐓 [ 𝐘 𝐘

Matrix Sketching • Some matrix sketching methods are efficient. • Time cost is o(𝑜𝑒𝑡) — lower than multiplication. • Examples: • Leverage score sampling: 𝑃(𝑜𝑒 log 𝑜) time • SRHT: 𝑃(𝑜𝑒 log 𝑡) time 𝐓 [ 𝐘 𝐘

Ridge Regression • Objective function: 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 𝐘𝐱 − 𝐳 𝑜 7 7 • Optimal solution: 𝐱 ⋆ = argmin 𝑔 𝐱 𝐱 = 𝐘 [ 𝐘 + 𝑜𝛿𝐉 S e 𝐘 [ 𝐳 • Time cost: 𝑃 𝑜𝑒 7 or 𝑃 𝑜𝑒𝑢

Sketched Ridge Regression • Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 − 𝐳 𝑜 7 7 𝐱

Sketched Ridge Regression • Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 − 𝐳 𝑜 7 7 𝐱 • Approach: reduce the size of 𝐘 and 𝐳 by matrix sketching.

Sketched Ridge Regression • Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 − 𝐓 [ 𝐳 𝑜 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + 𝑜𝛿𝐉 S e 𝐘 [ 𝐓𝐓 [ 𝐳

Sketched Ridge Regression • Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 − 𝐓 [ 𝐳 𝑜 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + 𝑜𝛿𝐉 S e 𝐘 [ 𝐓𝐓 [ 𝐳 • Time: 𝑃 𝑡𝑒 7 + 𝑈 _ _ is the cost of sketching 𝐓 [ 𝐘 • 𝑈 • E.g. 𝑈 _ = 𝑃(𝑜𝑒 log 𝑡) for SRHT. • E.g. 𝑈 _ = 𝑃 𝑜𝑒 log 𝑜 for leverage score sampling.

Theory: Optimization Perspective

Optimization Perspective 7 + 𝛿 𝐱 7 . • Recall the objective function 𝑔 𝐱 = i ^ 𝐘𝐱 − 𝐳 7 7 • Bound 𝑔 𝐱 O − 𝑔 𝐱 ⋆ . 7 ≤ 𝑔 𝐱 O − 𝑔 𝐱 ⋆ . ^ 𝐘𝐱 O − 𝐘𝐱 ⋆ • i 7

Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^×S : the design matrix • 𝛿: the regularization parameter • jS R • SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] • k jS lmn S ^qr 𝐘 p • uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T • 𝑔 𝐱 O − 𝑔 𝐱 ⋆ ≤ 𝜗𝑔 𝐱 ⋆ holds w.p. 0.9.

Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^×S : the design matrix • 𝛿: the regularization parameter • jS R • SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] • k jS lmn S ^qr 𝐘 p • uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T • 𝑔 𝐱 O − 𝑔 𝐱 ⋆ ≤ 𝜗𝑔 𝐱 ⋆ holds w.p. 0.9. 7 ≤ 𝜗𝑔 𝐱 ⋆ . ^ 𝐘𝐱 O − 𝐘𝐱 ⋆ i 7

Theory: Statistical Perspective

Statistical Model • 𝐘 ∈ ℝ ^×S : fixed design matrix • 𝐱 x ∈ ℝ S : the true and unknown model • 𝐳 = 𝐘𝐱 x + 𝛆: observed response vector • 𝜀 i , ⋯ , 𝜀 ^ are random noise 𝔽 𝛆𝛆 [ = 𝜊 7 𝐉 ^ • 𝔽 𝛆 = 𝟏 and

Bias-Variance Decomposition 7 i • Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 − 𝐘𝐱 x 7 • 𝔽 is taken w.r.t. the random noise 𝛆 .

Bias-Variance Decomposition 7 i • Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 − 𝐘𝐱 x 7 • 𝔽 is taken w.r.t. the random noise 𝛆 . • Risk measures prediction error.

Bias-Variance Decomposition 7 i • Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 − 𝐘𝐱 x 7 • R 𝐱 = bias 7 𝐱 + var 𝐱

� � Bias-Variance Decomposition 7 i • Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 − 𝐘𝐱 x 7 • R 𝐱 = bias 7 𝐱 + var 𝐱 • bias 𝐱 ⋆ = 𝛿 𝑜 𝚻 7 + 𝑜𝛿𝐉 S ƒi 𝚻𝐖 [ 𝐱 x 7 , Optimal Solution • var 𝐱 ⋆ = … p 7 , 𝐉 S + 𝑜𝛿𝚻 ƒ7 ƒi 7 ^ • bias 𝐱 O = 𝛿 𝑜 𝚻𝐕 [ 𝐓𝐓 [ 𝐕𝚻 + 𝑜𝛿𝐉 S e 𝚻𝐖 [ 𝐱 x 7 , Sketched 7 Solution • var 𝐱 O = … p 𝐕 [ 𝐓𝐓 [ 𝐕 + 𝑜𝛿𝚻 ƒ7 e 𝐕 [ 𝐓𝐓 [ , ^ 7 • Here 𝐘 = 𝐕𝚻𝐖 [ is the SVD.

Statistical Perspective For the sketching methods S • SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^×S : the design matrix • ^ • 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S • uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 − 𝜗 ≤ bias 𝐱 O bias 𝐱 ⋆ ≤ 1 + 𝜗, Good! 𝑡 ≤ var 𝐱 O 1 − 𝜗 𝑜 var 𝐱 ⋆ ≤ 1 + 𝜗 𝑜 Bad! Because 𝑜 ≫ 𝑡 . 𝑡 .

Statistical Perspective For the sketching methods S • SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^×S : the design matrix • ^ • 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S • uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 − 𝜗 ≤ bias 𝐱 O bias 𝐱 ⋆ ≤ 1 + 𝜗, If 𝐳 is noisy variance dominates bias 𝑡 ≤ var 𝐱 O 1 − 𝜗 𝑜 var 𝐱 ⋆ ≤ 1 + 𝜗 𝑜 𝑆 𝐱 _ ≫ 𝑆(𝐱 ⋆ ) . 𝑡 .

Conclusions • Use sketched solution to initialize numerical optimization. Optimization Perspective • 𝐘𝐱 O is close to 𝐘𝐱 ⋆ .

� � Conclusions • Use sketched solution to initialize numerical optimization. Optimization Perspective • 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . • 𝐱 ‡ : output of the 𝑢-th iteration of CG algorithm. p 𝐘𝐱 Š ƒ𝐘𝐱 ⋆ ‡ • 𝐘 Ž 𝐘 ƒi . p p ≤ 2 • • 𝐘 Ž 𝐘 𝐘𝐱 ‹ ƒ𝐘𝐱 ⋆ ri p • Initialization is important.

Conclusions • Use sketched solution to initialize numerical optimization. Optimization Perspective • 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . • Never use sketched solution to replace the optimal solution. Statistical Perspective • Much higher variance è bad generalization.

Model Averaging

Model Averaging • Independently draw 𝐓 i , ⋯ , 𝐓 • . O , ⋯ , 𝐱 • O . • Compute the sketched solutions 𝐱 i • Model averaging: 𝐱 O = i • O • ∑ 𝐱 ‘ . ‘’i

Optimization Perspective • For sufficiently large 𝑡 , Without model averaging • ƒ“ 𝐱 ⋆ “ 𝐱 ” ≤ 𝜗 holds w.h.p. “ 𝐱 ⋆

Optimization Perspective • For sufficiently large 𝑡 , Without model averaging • ƒ“ 𝐱 ⋆ “ 𝐱 ” ≤ 𝜗 holds w.h.p. “ 𝐱 ⋆ • Using the same matrix sketching and same 𝑡 , With model averaging “ 𝐱 • ƒ“ 𝐱 ⋆ ≤ T • + 𝜗 7 holds w.h.p. “ 𝐱 ⋆

Sketched Ridge Regression: Optimization and Statistical Perspectives - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Neatening sketched strokes using piecewise French Curves James McCrae, Karan Singh French Curves

Divisions Of Weekly Time Divisions of time in our lives have thus been sketched by Dr. C. C.

Sketched Representations and Orthogonal Planarity of Bounded Treewidth Graphs Emilio Di Giacomo,

Solving Multicollinearity Problem Using Ridge Regression Models Yewon Kim 12/03/2015

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Appendix Reconciliation of GAAP to Non-GAAP Financial Measures The Companys presentations may

Leveraged volume sampling for linear regression Micha l Derezi nski Manfred K. Warmuth

Scheduling RTI and Special Services in Elementary Schools: No More "When can I have your

User-level Andreas Zoor & scheduling Nikolai Nagibin Whats User-level scheduling

"The future is already here it is just not evenly distributed." William Gibson

How To Leverage Adversity and Turn Setbacks Into Springboards Claire Nana LMFT

How to Leverage a Large Dataset of Formalized Mathematics with Machine Learning? uller 1 Michael

TO LEVERAGE WITH YOUR TECHNOLOGY AND PROJECTS INFO. INSIGHT. INCOME. Michael Steifman, CEO

Sketched Ridge Regression: Optimization and Statistical Perspectives - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Neatening sketched strokes using piecewise French Curves James McCrae, Karan Singh French Curves

Divisions Of Weekly Time Divisions of time in our lives have thus been sketched by Dr. C. C.

Sketched Representations and Orthogonal Planarity of Bounded Treewidth Graphs Emilio Di Giacomo,

Solving Multicollinearity Problem Using Ridge Regression Models Yewon Kim 12/03/2015

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Appendix Reconciliation of GAAP to Non-GAAP Financial Measures The Companys presentations may

Leveraged volume sampling for linear regression Micha l Derezi nski Manfred K. Warmuth

Scheduling RTI and Special Services in Elementary Schools: No More &quot;When can I have your

User-level Andreas Zoor &amp; scheduling Nikolai Nagibin Whats User-level scheduling

&quot;The future is already here it is just not evenly distributed.&quot; William Gibson

How To Leverage Adversity and Turn Setbacks Into Springboards Claire Nana LMFT

How to Leverage a Large Dataset of Formalized Mathematics with Machine Learning? uller 1 Michael

TO LEVERAGE WITH YOUR TECHNOLOGY AND PROJECTS INFO. INSIGHT. INCOME. Michael Steifman, CEO

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Scheduling RTI and Special Services in Elementary Schools: No More "When can I have your

User-level Andreas Zoor & scheduling Nikolai Nagibin Whats User-level scheduling

"The future is already here it is just not evenly distributed." William Gibson