On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , - - PowerPoint PPT Presentation

on coresets for regularized regression
SMART_READER_LITE
LIVE PREVIEW

On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , - - PowerPoint PPT Presentation

On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , Anirban Dasgupta and Supratim Shit IIT Gandhinagar June 15, 2020 Motivation Coresets : Small summary of data for some cost function as proxy for original data Motivation


slide-1
SLIDE 1

On Coresets For Regularized Regression

ICML 2020 Rachit Chhaya, Anirban Dasgupta and Supratim Shit

IIT Gandhinagar

June 15, 2020

slide-2
SLIDE 2

Motivation

◮ Coresets : Small summary of data for some cost function as proxy for original data

slide-3
SLIDE 3

Motivation

◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17].

slide-4
SLIDE 4

Motivation

◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17]. ◮ No study of coresets for regularized regression for general p-norm.

slide-5
SLIDE 5

Motivation

◮ Coresets : Small summary of data for some cost function as proxy for original data ◮ Coresets for ridge regression (smaller) shown by [ACW17]. ◮ No study of coresets for regularized regression for general p-norm.

slide-6
SLIDE 6

Our Contributions

◮ No coreset for minx∈Rd Ax − br

p + λxs q, where r = s

smaller in size than that for minx∈Rd Ax − br

p

slide-7
SLIDE 7

Our Contributions

◮ No coreset for minx∈Rd Ax − br

p + λxs q, where r = s

smaller in size than that for minx∈Rd Ax − br

p

◮ Implies no coreset for Lasso smaller in size than that of least squares regression

◮ Introducing modified lasso and building smaller coreset for it.

slide-8
SLIDE 8

Our Contributions

◮ No coreset for minx∈Rd Ax − br

p + λxs q, where r = s

smaller in size than that for minx∈Rd Ax − br

p

◮ Implies no coreset for Lasso smaller in size than that of least squares regression

◮ Introducing modified lasso and building smaller coreset for it. ◮ Coresets for ℓp-regression with ℓp regularization. Extension to multiple response regression

slide-9
SLIDE 9

Our Contributions

◮ No coreset for minx∈Rd Ax − br

p + λxs q, where r = s

smaller in size than that for minx∈Rd Ax − br

p

◮ Implies no coreset for Lasso smaller in size than that of least squares regression

◮ Introducing modified lasso and building smaller coreset for it. ◮ Coresets for ℓp-regression with ℓp regularization. Extension to multiple response regression ◮ Empirical Evaluations

slide-10
SLIDE 10

Coresets

Definition For ǫ > 0, a dataset A, a non-negative function f and a query space Q, C is an ǫ-coreset of A if ∀q ∈ Q

  • fq(A) − fq(C)
  • ≤ ǫfq(A)

We construct coresets which are subsamples (rescaled) from the

  • riginal data
slide-11
SLIDE 11

Sensitivity [LS10]

Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q

fq(xi)

  • x′∈X fq(x′).
slide-12
SLIDE 12

Sensitivity [LS10]

Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q

fq(xi)

  • x′∈X fq(x′).

◮ Determines highest fractional contribution of point to the cost function

slide-13
SLIDE 13

Sensitivity [LS10]

Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q

fq(xi)

  • x′∈X fq(x′).

◮ Determines highest fractional contribution of point to the cost function ◮ Can be used to create coresets. Coreset size is function of sum

  • f sensitivities and dimension of query space
slide-14
SLIDE 14

Sensitivity [LS10]

Definition The sensitivity of the ith point of some dataset X for a func- tion f and query space Q is defined as si = supq∈Q

fq(xi)

  • x′∈X fq(x′).

◮ Determines highest fractional contribution of point to the cost function ◮ Can be used to create coresets. Coreset size is function of sum

  • f sensitivities and dimension of query space

◮ Upper bounds to sensitivities are enough [FL11, BFL16]

slide-15
SLIDE 15

Coresets for Regularized Regression

◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc.

slide-16
SLIDE 16

Coresets for Regularized Regression

◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc. We are interested in the following problem : For λ > 0 min

x∈Rd Ax − br p + λxs q

for p, q ≥ 1 and r, s > 0.

slide-17
SLIDE 17

Coresets for Regularized Regression

◮ Regularization is important to prevent overfitting, numerical stability, induce sparsity etc. We are interested in the following problem : For λ > 0 min

x∈Rd Ax − br p + λxs q

for p, q ≥ 1 and r, s > 0. A coreset for this problem is (˜ A, ˜ b) such that ∀x ∈ Rd and ∀λ > 0, ~ Ax − ~ br

p + λxs q ∈ (1 ± ǫ)(Ax − br p + λxs q)

slide-18
SLIDE 18

Main Question

◮ Coresets for unregularized regression work for regularized counterpart

slide-19
SLIDE 19

Main Question

◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression

slide-20
SLIDE 20

Main Question

◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression ◮ Intuition : Regularization imposes a constraint on the solution space.

slide-21
SLIDE 21

Main Question

◮ Coresets for unregularized regression work for regularized counterpart ◮ [ACW17] showed coreset for ridge regression using ridge leverage scores. Coreset smaller than coresets for least squares regression ◮ Intuition : Regularization imposes a constraint on the solution space. ◮ Can we expect all regularized problems to have a smaller size coresets, than the unregularized version? For e.g. for Lasso

slide-22
SLIDE 22

Our Main Result

Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr

p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,

is also a coreset for Axr

p.

slide-23
SLIDE 23

Our Main Result

Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr

p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,

is also a coreset for Axr

p.

Implication: Smaller coresets for regularized problem are not

  • btained when r = s
slide-24
SLIDE 24

Our Main Result

Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr

p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,

is also a coreset for Axr

p.

Implication: Smaller coresets for regularized problem are not

  • btained when r = s

The popular Lasso problem falls under this category and hence does not have a coreset smaller than one for least square regression.

slide-25
SLIDE 25

Our Main Result

Theorem Given a matrix A ∈ Rn×d and λ > 0, any coreset for the problem Axr

p + λxs q, where r = s, p, q ≥ 1 and r, s > 0,

is also a coreset for Axr

p.

Implication: Smaller coresets for regularized problem are not

  • btained when r = s

The popular Lasso problem falls under this category and hence does not have a coreset smaller than one for least square regression. Proof by Contradiction

slide-26
SLIDE 26

Modified Lasso

min

x∈Rd ||Ax − b||2 2 + λ||x||2 1

◮ Constrained version same as lasso

slide-27
SLIDE 27

Modified Lasso

min

x∈Rd ||Ax − b||2 2 + λ||x||2 1

◮ Constrained version same as lasso ◮ Empirically shown to induce sparsity like lasso

slide-28
SLIDE 28

Modified Lasso

min

x∈Rd ||Ax − b||2 2 + λ||x||2 1

◮ Constrained version same as lasso ◮ Empirically shown to induce sparsity like lasso ◮ Allows smaller coreset than least squares regression

slide-29
SLIDE 29

Coreset for Modified Lasso

Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp

p + λxp p is also a coreset

  • f the function Ax − bp

p + λxp q where q ≤ p, p, q ≥ 1.

slide-30
SLIDE 30

Coreset for Modified Lasso

Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp

p + λxp p is also a coreset

  • f the function Ax − bp

p + λxp q where q ≤ p, p, q ≥ 1.

◮ Implication: Coresets for ridge regression also work for modified lasso

slide-31
SLIDE 31

Coreset for Modified Lasso

Theorem Given a matrix A ∈ Rn×d, corresponding vector b ∈ Rn, any coreset for the function Ax − bp

p + λxp p is also a coreset

  • f the function Ax − bp

p + λxp q where q ≤ p, p, q ≥ 1.

◮ Implication: Coresets for ridge regression also work for modified lasso ◮ Coreset of size O( sdλ(A) log sdλ(A)

ǫ2

) with a high probability for modified lasso ◮ sdλ(A) =

j∈[d] 1 1+ λ

σ2 j

≤ d

slide-32
SLIDE 32

Coresets for ℓp Regression with ℓp Regularization

The ℓp Regression with ℓp Regularization is given as min

x∈Rd Ax − bp p + λxp p

Coresets for ℓp regression constructed using the well conditioned basis Well conditioned Basis [DDH+09] A matrix U is called an (α, β, p) well-conditioned basis for A if Up ≤ α and ∀x ∈ Rd, xq ≤ βUxp where 1

p + 1 q = 1.

slide-33
SLIDE 33

◮ Sampling using the pth power of the p norm of rows of the (α, β, p) well-conditioned basis of [A, b], we can obtain a coreset of size ˜ O(αβ)p with high probability for ℓp regression ◮ For ℓp Regression with ℓp Regularization we bound the sensitivities by si ≤

βpuip

p

1+

λ A′p (p)

+ 1

n

slide-34
SLIDE 34

◮ Sampling using the pth power of the p norm of rows of the (α, β, p) well-conditioned basis of [A, b], we can obtain a coreset of size ˜ O(αβ)p with high probability for ℓp regression ◮ For ℓp Regression with ℓp Regularization we bound the sensitivities by si ≤

βpuip

p

1+

λ A′p (p)

+ 1

n

◮ Sum of sensitivities is bound by S ≤

(αβ)p 1+

λ A′p (p)

+ 1

slide-35
SLIDE 35

◮ The coreset size is O

  • (αβ)pd log 1

ǫ

  • 1+

λ A′p (p)

  • ǫ2
  • whp
slide-36
SLIDE 36

◮ The coreset size is O

  • (αβ)pd log 1

ǫ

  • 1+

λ A′p (p)

  • ǫ2
  • whp

◮ Coreset size is decreasing in λ

slide-37
SLIDE 37

◮ The coreset size is O

  • (αβ)pd log 1

ǫ

  • 1+

λ A′p (p)

  • ǫ2
  • whp

◮ Coreset size is decreasing in λ ◮ Specifically for Regularized Least Deviation problem we get coreset of size O

  • d5/2 log 1

ǫ

  • 1+

λ A′(1)

  • ǫ2
  • ◮ Results extend to Multiresponse Regularized Regression also
slide-38
SLIDE 38

Empirical Results

Sparsity Induced by Modified Lasso

Lambda Values No .of Zeros 10 20 30 0.001 0.01 0.05 0.1 0.2 0.5 0.75 1 2 5 7.5 10 15 25 50 Modified LASSO LASSO Ridge Regression

slide-39
SLIDE 39

Comparison with Uniform Sampling

Matrix size : 100000 × 30 Matrix with non uniform leverage scores [YMM15]

Table 1: Relative error of different coreset sizes for Modified Lasso, λ = 0.5 Sample Size Ridge Leverage Uniform Sampling Scores Sampling 30 0.059 0.8289 50 0.044 0.8289 100 0.031 0.8286 150 0.028 0.8286 200 0.013 0.8287

slide-40
SLIDE 40

Table 2: Relative error of different coreset sizes for RLAD, λ = 0.5 Sample Size Sensitivity Uniform Sampling based Sampling 30 0.69 385.99 50 0.65 112.70 100 0.34 98.53 150 0.19 96.09 200 0.17 27.49

slide-41
SLIDE 41

Conclusion and Future Work

◮ We present first work on coresets for regularized regression for general p norm. Open Questions ◮ Tighter bounds on sensitivity scores ◮ Coresets for other models with regularization and/or constraints.

slide-42
SLIDE 42

References I

Haim Avron, Kenneth L Clarkson, and David P Woodruff, Sharper bounds for regularized data fitting, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Vladimir Braverman, Dan Feldman, and Harry Lang, New frameworks for offline and streaming coreset constructions, arXiv preprint arXiv:1612.00889 (2016). Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Mahoney, Sampling algorithms and coresets for ℓp regression, SIAM Journal on Computing 38 (2009), no. 5, 2060–2078.

slide-43
SLIDE 43

References II

Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan, Sampling algorithms for l 2 regression and applications, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, Society for Industrial and Applied Mathematics, 2006, pp. 1127–1136. Dan Feldman and Michael Langberg, A unified framework for approximating and clustering data, Proceedings of the forty-third annual ACM symposium on Theory of computing, ACM, 2011, pp. 569–578. David Haussler, Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension, Journal of Combinatorial Theory, Series A 69 (1995), no. 2, 217–232.

slide-44
SLIDE 44

References III

Michael Langberg and Leonard J Schulman, Universal ε-approximators for integrals, Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, SIAM, 2010, pp. 598–607. Mert Pilanci and Martin J Wainwright, Randomized sketches of convex programs with sharp guarantees, IEEE Transactions on Information Theory 61 (2015), no. 9, 5096–5115. Jiyan Yang, Xiangrui Meng, and Michael W Mahoney, Implementing randomized matrix algorithms in parallel and distributed environments, Proceedings of the IEEE 104 (2015),

  • no. 1, 58–92.
slide-45
SLIDE 45

More references in the paper.... Thank You Hope to get your feedback and answer your questions at the live chat session Take Care