sketched ridge regression
play

Sketched Ridge Regression: Optimization and Statistical Perspectives - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7


  1. Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley

  2. Overview

  3. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Over-determined: π‘œ ≫ 𝑒 π‘œΓ—π‘’

  4. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ Efficient and approximate solution? β€’ Use only part of the data? π‘œΓ—π‘’

  5. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Matrix Sketching ng: β€’ Random selection β€’ Random projection

  6. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O

  7. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O S R β€’ Sketch size 𝑃 T sketch size

  8. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O S R β€’ Sketch size 𝑃 T β€’ 𝑔 𝐱 O ≀ 1 + πœ— min 𝐱 𝑔 𝐱 sketch size Optimization Perspective

  9. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Statistical Perspective β€’ Bias β€’ Variance

  10. Related Work 7 β€’ Least squares regression: min 𝐘𝐱 βˆ’ 𝐳 7 𝐱 Reference β€’ Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006. β€’ Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011. β€’ Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC , 2013. β€’ Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research , 2015. β€’ Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares. Journal of Machine Learning Research , 2015. β€’ Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research , 2016. β€’ Etc …

  11. Sketched Ridge Regression

  12. Matrix Sketching β€’ Turn big matrix into smaller one. β€’ 𝐘 ∈ ℝ ^Γ—S ➑ 𝐓 [ 𝐘 ∈ ℝ _Γ—S . β€’ 𝐓 ∈ ℝ ^Γ—_ is called sketching matrix, e.g., β€’ Uniform sampling β€’ Leverage score sampling β€’ Gaussian projection β€’ Subsampled randomized Hadamard transform (SRHT) β€’ Count sketch (sparse embedding) β€’ Etc. 𝐓 [ 𝐘 𝐘

  13. Matrix Sketching β€’ Some matrix sketching methods are efficient. β€’ Time cost is o(π‘œπ‘’π‘‘) β€” lower than multiplication. β€’ Examples: β€’ Leverage score sampling: 𝑃(π‘œπ‘’ log π‘œ) time β€’ SRHT: 𝑃(π‘œπ‘’ log 𝑑) time 𝐓 [ 𝐘 𝐘

  14. Ridge Regression β€’ Objective function: 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ Optimal solution: 𝐱 ⋆ = argmin 𝑔 𝐱 𝐱 = 𝐘 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐳 β€’ Time cost: 𝑃 π‘œπ‘’ 7 or 𝑃 π‘œπ‘’π‘’

  15. Sketched Ridge Regression β€’ Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 𝐱

  16. Sketched Ridge Regression β€’ Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 𝐱 β€’ Approach: reduce the size of 𝐘 and 𝐳 by matrix sketching.

  17. Sketched Ridge Regression β€’ Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 βˆ’ 𝐓 [ 𝐳 π‘œ 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐓𝐓 [ 𝐳

  18. Sketched Ridge Regression β€’ Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 βˆ’ 𝐓 [ 𝐳 π‘œ 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐓𝐓 [ 𝐳 β€’ Time: 𝑃 𝑑𝑒 7 + π‘ˆ _ _ is the cost of sketching 𝐓 [ 𝐘 β€’ π‘ˆ β€’ E.g. π‘ˆ _ = 𝑃(π‘œπ‘’ log 𝑑) for SRHT. β€’ E.g. π‘ˆ _ = 𝑃 π‘œπ‘’ log π‘œ for leverage score sampling.

  19. Theory: Optimization Perspective

  20. Optimization Perspective 7 + 𝛿 𝐱 7 . β€’ Recall the objective function 𝑔 𝐱 = i ^ 𝐘𝐱 βˆ’ 𝐳 7 7 β€’ Bound 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ . 7 ≀ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ . ^ 𝐘𝐱 O βˆ’ 𝐘𝐱 ⋆ β€’ i 7

  21. Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ 𝛿: the regularization parameter β€’ jS R β€’ SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] β€’ k jS lmn S ^qr 𝐘 p β€’ uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T β€’ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ ≀ πœ—π‘” 𝐱 ⋆ holds w.p. 0.9.

  22. Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ 𝛿: the regularization parameter β€’ jS R β€’ SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] β€’ k jS lmn S ^qr 𝐘 p β€’ uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T β€’ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ ≀ πœ—π‘” 𝐱 ⋆ holds w.p. 0.9. 7 ≀ πœ—π‘” 𝐱 ⋆ . ^ 𝐘𝐱 O βˆ’ 𝐘𝐱 ⋆ i 7

  23. Theory: Statistical Perspective

  24. Statistical Model β€’ 𝐘 ∈ ℝ ^Γ—S : fixed design matrix β€’ 𝐱 x ∈ ℝ S : the true and unknown model β€’ 𝐳 = 𝐘𝐱 x + 𝛆: observed response vector β€’ πœ€ i , β‹― , πœ€ ^ are random noise 𝔽 𝛆𝛆 [ = 𝜊 7 𝐉 ^ β€’ 𝔽 𝛆 = 𝟏 and

  25. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ 𝔽 is taken w.r.t. the random noise 𝛆 .

  26. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ 𝔽 is taken w.r.t. the random noise 𝛆 . β€’ Risk measures prediction error.

  27. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ R 𝐱 = bias 7 𝐱 + var 𝐱

  28. οΏ½ οΏ½ Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ R 𝐱 = bias 7 𝐱 + var 𝐱 β€’ bias 𝐱 ⋆ = 𝛿 π‘œ 𝚻 7 + π‘œπ›Ώπ‰ S Ζ’i πš»π– [ 𝐱 x 7 , Optimal Solution β€’ var 𝐱 ⋆ = … p 7 , 𝐉 S + π‘œπ›Ώπš» Ζ’7 Ζ’i 7 ^ β€’ bias 𝐱 O = 𝛿 π‘œ πš»π• [ 𝐓𝐓 [ π•πš» + π‘œπ›Ώπ‰ S e πš»π– [ 𝐱 x 7 , Sketched 7 Solution β€’ var 𝐱 O = … p 𝐕 [ 𝐓𝐓 [ 𝐕 + π‘œπ›Ώπš» Ζ’7 e 𝐕 [ 𝐓𝐓 [ , ^ 7 β€’ Here 𝐘 = π•πš»π– [ is the SVD.

  29. Statistical Perspective For the sketching methods S β€’ SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ ^ β€’ 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S β€’ uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 βˆ’ πœ— ≀ bias 𝐱 O bias 𝐱 ⋆ ≀ 1 + πœ—, Good! 𝑑 ≀ var 𝐱 O 1 βˆ’ πœ— π‘œ var 𝐱 ⋆ ≀ 1 + πœ— π‘œ Bad! Because π‘œ ≫ 𝑑 . 𝑑 .

  30. Statistical Perspective For the sketching methods S β€’ SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ ^ β€’ 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S β€’ uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 βˆ’ πœ— ≀ bias 𝐱 O bias 𝐱 ⋆ ≀ 1 + πœ—, If 𝐳 is noisy variance dominates bias 𝑑 ≀ var 𝐱 O 1 βˆ’ πœ— π‘œ var 𝐱 ⋆ ≀ 1 + πœ— π‘œ 𝑆 𝐱 _ ≫ 𝑆(𝐱 ⋆ ) . 𝑑 .

  31. Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ .

  32. οΏ½ οΏ½ Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . β€’ 𝐱 ‑ : output of the 𝑒-th iteration of CG algorithm. p 𝐘𝐱 Ε  Ζ’π˜π± ⋆ ‑ β€’ 𝐘 Ε½ 𝐘 Ζ’i . p p ≀ 2 β€’ β€’ 𝐘 Ε½ 𝐘 𝐘𝐱 β€Ή Ζ’π˜π± ⋆ ri p β€’ Initialization is important.

  33. Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . β€’ Never use sketched solution to replace the optimal solution. Statistical Perspective β€’ Much higher variance Γ¨ bad generalization.

  34. Model Averaging

  35. Model Averaging β€’ Independently draw 𝐓 i , β‹― , 𝐓 β€’ . O , β‹― , 𝐱 β€’ O . β€’ Compute the sketched solutions 𝐱 i β€’ Model averaging: 𝐱 O = i β€’ O β€’ βˆ‘ 𝐱 β€˜ . β€˜β€™i

  36. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆

  37. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆ β€’ Using the same matrix sketching and same 𝑑 , With model averaging β€œ 𝐱 β€’ Ζ’β€œ 𝐱 ⋆ ≀ T β€’ + πœ— 7 holds w.h.p. β€œ 𝐱 ⋆

  38. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆ β€’ Using the same matrix sketching and same 𝑑 , With model averaging β€œ 𝐱 β€’ Ζ’β€œ 𝐱 ⋆ ≀ T β€’ + πœ— 7 holds w.h.p. β€œ 𝐱 ⋆

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend