Sketched Ridge Regression: Optimization and Statistical Perspectives - - PowerPoint PPT Presentation
Sketched Ridge Regression: Optimization and Statistical Perspectives - - PowerPoint PPT Presentation
Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7
Overview
Ridge Regression
Over-determined: π β« π
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
πΓπ
Ridge Regression
πΓπ min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
- Efficient and approximate solution?
- Use only part of the data?
Ridge Regression
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
Matrix Sketching ng:
- Random selection
- Random projection
Approximate Ridge Regression
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
- Sketched solution: π±O
Approximate Ridge Regression
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
- Sketched solution: π±O
- Sketch size π
R
S T sketch size
Approximate Ridge Regression
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
- Sketched solution: π±O
- Sketch size π
R
S T
- π π±O β€ 1 + π min
π± π π±
sketch size
Optimization Perspective
Approximate Ridge Regression
min
π± π π± = 1
π ππ± β π³
7 7 + πΏ π± 7 7
Statistical Perspective
- Bias
- Variance
Related Work
- Least squares regression: min
π±
ππ± β π³
7 7
Reference
- Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006.
- Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011.
- Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC, 2013.
- Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research,
2015.
- Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least
- squares. Journal of Machine Learning Research, 2015.
- Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of
Machine Learning Research, 2016.
- Etc β¦
Sketched Ridge Regression
Matrix Sketching
π π[π
- Turn big matrix into smaller one.
- π β β^ΓS β‘ π[π β β_ΓS.
- π β β^Γ_ is called sketching matrix, e.g.,
- Uniform sampling
- Leverage score sampling
- Gaussian projection
- Subsampled randomized Hadamard transform (SRHT)
- Count sketch (sparse embedding)
- Etc.
Matrix Sketching
π π[π
- Some matrix sketching methods are efficient.
- Time cost is o(πππ‘) β lower than multiplication.
- Examples:
- Leverage score sampling: π(ππ log π) time
- SRHT: π(ππ log π‘) time
Ridge Regression
- Objective function:
π π± = 1 π ππ± β π³
7 7 + πΏ π± 7 7
- Optimal solution:
π±β = argmin
π±
π π± = π[π + ππΏπS e π[π³
- Time cost: π ππ7 or π πππ’
Sketched Ridge Regression
- Goal: efficiently and approximately solve
argmin
π±
π π± = 1 π ππ± β π³
7 7 + πΏ π± 7 7 .
Sketched Ridge Regression
- Goal: efficiently and approximately solve
argmin
π±
π π± = 1 π ππ± β π³
7 7 + πΏ π± 7 7 .
- Approach: reduce the size of π and π³ by matrix sketching.
Sketched Ridge Regression
- Sketched solution:
π±O = argmin
π±
1 π π[ππ± β π[π³
7 7 + πΏ π± 7 7
= π[ππ[π + ππΏπS e π[ππ[π³
Sketched Ridge Regression
- Sketched solution:
- Time: π π‘π7 + π
_
- π
_ is the cost of sketching π[π
- E.g. π
_ = π(ππ log π‘) for SRHT.
- E.g. π
_ = π ππ log π for leverage score sampling.
π±O = argmin
π±
1 π π[ππ± β π[π³
7 7 + πΏ π± 7 7
= π[ππ[π + ππΏπS e π[ππ[π³
Theory: Optimization Perspective
Optimization Perspective
- Recall the objective function π π± = i
^ ππ± β π³ 7 7 + πΏ π± 7 7.
- Bound π π±O β π π±β .
- i
^ ππ±O β ππ±β 7 7 β€ π π±O β π π±β .
Optimization Perspective
For the sketching methods
- SRHT or leverage sampling with s = π
R
jS T ,
- uniform sampling with s = π
k jS lmn S T
, π π±O β π π±β β€ ππ π±β holds w.p. 0.9.
- π β β^ΓS: the design matrix
- πΏ: the regularization parameter
- πΎ =
π p
p
^qr π p
p β (0, 1]
- π β 1,
^ S : the row coherence of π
Optimization Perspective
For the sketching methods
- SRHT or leverage sampling with s = π
R
jS T ,
- uniform sampling with s = π
k jS lmn S T
, π π±O β π π±β β€ ππ π±β holds w.p. 0.9.
i ^ ππ±O β ππ±β 7 7 β€ ππ π±β .
- π β β^ΓS: the design matrix
- πΏ: the regularization parameter
- πΎ =
π p
p
^qr π p
p β (0, 1]
- π β 1,
^ S : the row coherence of π
Theory: Statistical Perspective
Statistical Model
- π β β^ΓS: fixed design matrix
- π±x β βS: the true and unknown model
- π³ = ππ±x + π: observed response vector
- πi, β― , π^ are random noise
- π½ π = π and
π½ ππ[ = π7π^
Bias-Variance Decomposition
- Risk: π π± =
i ^ π½ ππ± β ππ±x 7 7
- π½ is taken w.r.t. the random noise π.
Bias-Variance Decomposition
- Risk: π π± =
i ^ π½ ππ± β ππ±x 7 7
- π½ is taken w.r.t. the random noise π.
- Risk measures prediction error.
Bias-Variance Decomposition
- Risk: π π± =
i ^ π½ ππ± β ππ±x 7 7
- R π± = bias7 π± + var π±
Bias-Variance Decomposition
- Risk: π π± =
i ^ π½ ππ± β ππ±x 7 7
- R π± = bias7 π± + var π±
- bias π±β = πΏ π
- π»7 + ππΏπS Ζiπ»π[π±x
7,
- var π±β = β¦p
^
πS + ππΏπ»Ζ7 Ζi
7 7,
- bias π±O = πΏ π
- π»π[ππ[ππ» + ππΏπS eπ»π[π±x
7,
- var π±O = β¦p
^
π[ππ[π + ππΏπ»Ζ7 eπ[ππ[
7 7
,
- Here π = ππ»π[ is the SVD.
Optimal Solution Sketched Solution
Statistical Perspective
For the sketching methods
- SRHT or leverage sampling with s = π
R
S Tp ,
- uniform sampling with s = π
k S lmn S Tp
,
the followings hold w.p. 0.9:
1 β π β€ bias π±O bias π±β β€ 1 + π, 1 β π π π‘ β€ var π±O var π±β β€ 1 + π π π‘ .
- π β β^ΓS: the design matrix
- π β 1,
^ S : the row coherence of π
Good! Bad! Because π β« π‘.
Statistical Perspective
For the sketching methods
- SRHT or leverage sampling with s = π
R
S Tp ,
- uniform sampling with s = π
k S lmn S Tp
,
the followings hold w.p. 0.9:
1 β π β€ bias π±O bias π±β β€ 1 + π, 1 β π π π‘ β€ var π±O var π±β β€ 1 + π π π‘ .
- π β β^ΓS: the design matrix
- π β 1,
^ S : the row coherence of π
If π³ is noisy
variance dominates bias π π±_ β« π(π±β).
Conclusions
- Use sketched solution to initialize numerical
- ptimization.
- ππ±O is close to ππ±β.
Optimization Perspective
Conclusions
- Use sketched solution to initialize numerical
- ptimization.
- ππ±O is close to ππ±β.
- π± β‘ : output of the π’-th iteration of CG algorithm.
- ππ± Ε Ζππ±β
p p
ππ± βΉ Ζππ±β
p p β€ 2
- πΕ½π
- Ζi
- πΕ½π
- ri
β‘
.
- Initialization is important.
Optimization Perspective
Conclusions
- Use sketched solution to initialize numerical
- ptimization.
- ππ±O is close to ππ±β.
- Never use sketched solution to replace the
- ptimal solution.
- Much higher variance Γ¨ bad generalization.
Optimization Perspective Statistical Perspective
Model Averaging
Model Averaging
- Independently draw πi, β― , πβ’.
- Compute the sketched solutions π±i
O, β― , π±β’ O.
- Model averaging: π±O =
i
- β
π±β
O
- ββi
.
Optimization Perspective
- For sufficiently large π‘,
β π±β
- Ζβ π±β
β π±β
β€ π holds w.h.p.
Without model averaging
Optimization Perspective
- For sufficiently large π‘,
β π±β
- Ζβ π±β
β π±β
β€ π holds w.h.p.
- Using the same matrix sketching and same π‘,
β π±β’ Ζβ π±β β π±β
β€ T
- + π7
holds w.h.p.
Without model averaging With model averaging
Optimization Perspective
- For sufficiently large π‘,
β π±β
- Ζβ π±β
β π±β
β€ π holds w.h.p.
- Using the same matrix sketching and same π‘,
β π±β’ Ζβ π±β β π±β
β€ T
- + π7
holds w.h.p.
Without model averaging With model averaging
Optimization Perspective
- For sufficiently large π‘,
β π±β
- Ζβ π±β
β π±β
β€ π holds w.h.p.
- Using the same matrix sketching and same π‘,
β π±β’ Ζβ π±β β π±β
β€ T
- + π7
holds w.h.p.
Without model averaging With model averaging
If π‘ β« π π7 is very small error bound β
T
- .
Statistical Perspective
- Risk: π π± = i
^ π½ ππ± β ππ±x 7 7 = bias7 π± + var π±
- Model averaging :
- bias π±O = πΏ π
- i
- β
π»π[πβπβ
[ππ» + ππΏπS eπ»π[π±x
- ββi
7
,
- var π±O = β¦p
^ i
- β
π[πβπβ
[π + ππΏπ»Ζ7 eπ[πβπβ [
- ββi
7 7
.
- Here π = ππ»π[ is the SVD.
Statistical Perspective
- For sufficiently large π‘, the followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β€ π π‘ 1 + π .
- Using the same sketching methods and same π‘, the
followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β² π π‘ 1 π
- + π
π
Without model averaging
Statistical Perspective
- For sufficiently large π‘, the followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β€ π π‘ 1 + π .
- Using the same sketching methods and same π‘, the
followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β² π π‘ 1 π
- + π
π
Without model averaging With model averaging
Statistical Perspective
- For sufficiently large π‘, the followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β€ π π‘ 1 + π .
- Using the same sketching methods and same π‘, the
followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β² π π‘ 1 π
- + π
π
Without model averaging With model averaging
Statistical Perspective
- For sufficiently large π‘, the followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β€ π π‘ 1 + π .
- Using the same sketching methods and same π‘, the
followings hold w.h.p.:
bias π±O bias π±β β€ 1 + π and var π±O var π±β β² π π‘ 1 π
- + π
π
Without model averaging With model averaging
If π is small, then var π±O β i
- .
Applications to Distributed Optimization
- π²i, π§i , β― , π²^, π§^
are (randomly) split among π machines.
- Equivalent to uniform
sampling with π‘ = ^
- .
Optimization Perspective
- Application to distributed optimization:
- If π‘ = ^
- β« π, π±O is very close to π±β (provably).
- π±O is good initialization of distributed optimization algorithms.
Statistical Perspective
- Application to distributed machine learning:
- If π‘ = ^
- β« π, then π π±O is comparable to π π±β .
- If low-precision solution suffices, then π±O is a good substitute of π±β.
- One-shot solution.