SLIDE 1
On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , - - PowerPoint PPT Presentation
On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , - - PowerPoint PPT Presentation
On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , Anirban Dasgupta and Supratim Shit IIT Gandhinagar June 15, 2020 Motivation Coresets : Small summary of data for some cost function as proxy for original data Motivation
SLIDE 2
SLIDE 3
Motivation
◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17].
SLIDE 4
Motivation
◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17]. ◮ No study of coresets for regularized regression for general p-norm.
SLIDE 5
Motivation
◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17]. ◮ No study of coresets for regularized regression for general p-norm.
SLIDE 6
Our Contributions
◮ No coreset for minx∈Rd Ax − br
p + λxs q, where r = s
smaller in size than that for minx∈Rd Ax − br
p
SLIDE 7
Our Contributions
◮ No coreset for minx∈Rd Ax − br
p + λxs q, where r = s
smaller in size than that for minx∈Rd Ax − br
p
◮ Implies no coreset for Lasso smaller in size than that of least squares regression
◮ Introducing modified lasso and building smaller coreset for it.
SLIDE 8
Our Contributions
◮ No coreset for minx∈Rd Ax − br
p + λxs q, where r = s
smaller in size than that for minx∈Rd Ax − br
p
◮ Implies no coreset for Lasso smaller in size than that of least squares regression
◮ Introducing modified lasso and building smaller coreset for it. ◮ Coresets for ℓp-regression with ℓp regularization. Extension to multiple response regression
SLIDE 9
Our Contributions
◮ No coreset for minx∈Rd Ax − br
p + λxs q, where r = s
smaller in size than that for minx∈Rd Ax − br
p
◮ Implies no coreset for Lasso smaller in size than that of least squares regression
◮ Introducing modified lasso and building smaller coreset for it. ◮ Coresets for ℓp-regression with ℓp regularization. Extension to multiple response regression ◮ Empirical Evaluations
SLIDE 10
Coresets
Definition For ǫ > 0, a dataset A, a non-negative function f and a query space Q, C is an ǫ-coreset of A if ∀q ∈ Q
- fq(A) − fq(C)
- ≤ ǫfq(A)
We construct coresets which are subsamples (rescaled) from the
- riginal data
SLIDE 11
Sensitivity [LS10]
Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q
fq(xi)
- x′∈X fq(x′).
SLIDE 12
Sensitivity [LS10]
Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q
fq(xi)
- x′∈X fq(x′).
◮ Determines highest fractional contribution of point to the cost function
SLIDE 13
Sensitivity [LS10]
Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q
fq(xi)
- x′∈X fq(x′).
◮ Determines highest fractional contribution of point to the cost function ◮ Can be used to create coresets. Coreset size is function of sum
- f sensitivities and dimension of query space
SLIDE 14
Sensitivity [LS10]
Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q
fq(xi)
- x′∈X fq(x′).
◮ Determines highest fractional contribution of point to the cost function ◮ Can be used to create coresets. Coreset size is function of sum
- f sensitivities and dimension of query space
◮ Upper bounds to sensitivities are enough [FL11, BFL16]
SLIDE 15
Coresets for Regularized Regression
◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc.
SLIDE 16
Coresets for Regularized Regression
◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc. We are interested in the following problem : For λ > 0 min
x∈Rd Ax − br p + λxs q
for p, q ≥ 1 and r, s > 0.
SLIDE 17
Coresets for Regularized Regression
◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc. We are interested in the following problem : For λ > 0 min
x∈Rd Ax − br p + λxs q
for p, q ≥ 1 and r, s > 0. A coreset for this problem is (˜ A, ˜ b) such that ∀x ∈ Rd and ∀λ > 0, ~ Ax − ~ br
p + λxs q ∈ (1 ± ǫ)(Ax − br p + λxs q)
SLIDE 18
Main Question
◮ Coresets for unregularized regression work for regularized counterpart
SLIDE 19
Main Question
◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression
SLIDE 20
Main Question
◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression ◮ Intuition : Regularization imposes a constraint on the solution space.
SLIDE 21
Main Question
◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression ◮ Intuition : Regularization imposes a constraint on the solution space. ◮ Can we expect all regularized problems to have a smaller size coresets, than the unregularized version? For e.g. for Lasso
SLIDE 22
Our Main Result
Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr
p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,
is also a coreset for Axr
p.
SLIDE 23
Our Main Result
Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr
p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,
is also a coreset for Axr
p.
Implication: Smaller coresets for regularized problem are not
- btained when r = s
SLIDE 24
Our Main Result
Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr
p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,
is also a coreset for Axr
p.
Implication: Smaller coresets for regularized problem are not
- btained when r = s
The popular Lasso problem falls under this category and hence does not have a coreset smaller than one for least square regression.
SLIDE 25
Our Main Result
Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr
p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,
is also a coreset for Axr
p.
Implication: Smaller coresets for regularized problem are not
- btained when r = s
The popular Lasso problem falls under this category and hence does not have a coreset smaller than one for least square regression. Proof by Contradiction
SLIDE 26
Modified Lasso
min
x∈Rd ||Ax − b||2 2 + λ||x||2 1
◮ Constrained version same as lasso
SLIDE 27
Modified Lasso
min
x∈Rd ||Ax − b||2 2 + λ||x||2 1
◮ Constrained version same as lasso ◮ Empirically shown to induce sparsity like lasso
SLIDE 28
Modified Lasso
min
x∈Rd ||Ax − b||2 2 + λ||x||2 1
◮ Constrained version same as lasso ◮ Empirically shown to induce sparsity like lasso ◮ Allows smaller coreset than least squares regression
SLIDE 29
Coreset for Modified Lasso
Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp
p + λxp p is also a coreset
- f the function Ax − bp
p + λxp q where q ≤ p, p, q ≥ 1.
SLIDE 30
Coreset for Modified Lasso
Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp
p + λxp p is also a coreset
- f the function Ax − bp
p + λxp q where q ≤ p, p, q ≥ 1.
◮ Implication: Coresets for ridge regression also work for modified lasso
SLIDE 31
Coreset for Modified Lasso
Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp
p + λxp p is also a coreset
- f the function Ax − bp
p + λxp q where q ≤ p, p, q ≥ 1.
◮ Implication: Coresets for ridge regression also work for modified lasso ◮ Coreset of size O( sdλ(A) log sdλ(A)
ǫ2
) with a high probability for modified lasso ◮ sdλ(A) =
j∈[d] 1 1+ λ
σ2 j
≤ d
SLIDE 32
Coresets for ℓp Regression with ℓp Regularization
The ℓp Regression with ℓp Regularization is given as min
x∈Rd Ax − bp p + λxp p
Coresets for ℓp regression constructed using the well conditioned basis Well conditioned Basis [DDH+09] A matrix U is called an (α, β, p) well-conditioned basis for A if Up ≤ α and ∀x ∈ Rd, xq ≤ βUxp where 1
p + 1 q = 1.
SLIDE 33
◮ Sampling using the pth power of the p norm of rows of the (α, β, p) well-conditioned basis of [A, b], we can obtain a coreset of size ˜ O(αβ)p with high probability for ℓp regression ◮ For ℓp Regression with ℓp Regularization we bound the sensitivities by si ≤
βpuip
p
1+
λ A′p (p)
+ 1
n
SLIDE 34
◮ Sampling using the pth power of the p norm of rows of the (α, β, p) well-conditioned basis of [A, b], we can obtain a coreset of size ˜ O(αβ)p with high probability for ℓp regression ◮ For ℓp Regression with ℓp Regularization we bound the sensitivities by si ≤
βpuip
p
1+
λ A′p (p)
+ 1
n
◮ Sum of sensitivities is bound by S ≤
(αβ)p 1+
λ A′p (p)
+ 1
SLIDE 35
◮ The coreset size is O
- (αβ)pd log 1
ǫ
- 1+
λ A′p (p)
- ǫ2
- whp
SLIDE 36
◮ The coreset size is O
- (αβ)pd log 1
ǫ
- 1+
λ A′p (p)
- ǫ2
- whp
◮ Coreset size is decreasing in λ
SLIDE 37
◮ The coreset size is O
- (αβ)pd log 1
ǫ
- 1+
λ A′p (p)
- ǫ2
- whp
◮ Coreset size is decreasing in λ ◮ Specifically for Regularized Least Deviation problem we get coreset of size O
- d5/2 log 1
ǫ
- 1+
λ A′(1)
- ǫ2
- ◮ Results extend to Multiresponse Regularized Regression also
SLIDE 38
Empirical Results
Sparsity Induced by Modified Lasso
Lambda Values No .of Zeros 10 20 30 0.001 0.01 0.05 0.1 0.2 0.5 0.75 1 2 5 7.5 10 15 25 50 Modified LASSO LASSO Ridge Regression
SLIDE 39
Comparison with Uniform Sampling
Matrix size : 100000 × 30 Matrix with non uniform leverage scores [YMM15]
Table 1: Relative error of different coreset sizes for Modified Lasso, λ = 0.5 Sample Size Ridge Leverage Uniform Sampling Scores Sampling 30 0.059 0.8289 50 0.044 0.8289 100 0.031 0.8286 150 0.028 0.8286 200 0.013 0.8287
SLIDE 40
Table 2: Relative error of different coreset sizes for RLAD, λ = 0.5 Sample Size Sensitivity Uniform Sampling based Sampling 30 0.69 385.99 50 0.65 112.70 100 0.34 98.53 150 0.19 96.09 200 0.17 27.49
SLIDE 41
Conclusion and Future Work
◮ We present first work on coresets for regularized regression for general p norm. Open Questions ◮ Tighter bounds on sensitivity scores ◮ Coresets for other models with regularization and/or constraints.
SLIDE 42
References I
Haim Avron, Kenneth L Clarkson, and David P Woodruff, Sharper bounds for regularized data fitting, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Vladimir Braverman, Dan Feldman, and Harry Lang, New frameworks for offline and streaming coreset constructions, arXiv preprint arXiv:1612.00889 (2016). Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Mahoney, Sampling algorithms and coresets for ℓp regression, SIAM Journal on Computing 38 (2009), no. 5, 2060–2078.
SLIDE 43
References II
Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan, Sampling algorithms for l 2 regression and applications, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, Society for Industrial and Applied Mathematics, 2006, pp. 1127–1136. Dan Feldman and Michael Langberg, A unified framework for approximating and clustering data, Proceedings of the forty-third annual ACM symposium on Theory of computing, ACM, 2011, pp. 569–578. David Haussler, Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension, Journal of Combinatorial Theory, Series A 69 (1995), no. 2, 217–232.
SLIDE 44
References III
Michael Langberg and Leonard J Schulman, Universal ε-approximators for integrals, Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, SIAM, 2010, pp. 598–607. Mert Pilanci and Martin J Wainwright, Randomized sketches of convex programs with sharp guarantees, IEEE Transactions on Information Theory 61 (2015), no. 9, 5096–5115. Jiyan Yang, Xiangrui Meng, and Michael W Mahoney, Implementing randomized matrix algorithms in parallel and distributed environments, Proceedings of the IEEE 104 (2015),
- no. 1, 58–92.
SLIDE 45