Sketched Ridge Regression: Optimization and Statistical Perspectives - - PowerPoint PPT Presentation

β–Ά
sketched ridge regression
SMART_READER_LITE
LIVE PREVIEW

Sketched Ridge Regression: Optimization and Statistical Perspectives - - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7


slide-1
SLIDE 1

Sketched Ridge Regression:

Shusen Wang

UC Berkeley

Alex Gittens

RPI

Michael Mahoney

UC Berkeley

Optimization and Statistical Perspectives

slide-2
SLIDE 2

Overview

slide-3
SLIDE 3

Ridge Regression

Over-determined: π‘œ ≫ 𝑒

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

π‘œΓ—π‘’

slide-4
SLIDE 4

Ridge Regression

π‘œΓ—π‘’ min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

  • Efficient and approximate solution?
  • Use only part of the data?
slide-5
SLIDE 5

Ridge Regression

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

Matrix Sketching ng:

  • Random selection
  • Random projection
slide-6
SLIDE 6

Approximate Ridge Regression

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

  • Sketched solution: 𝐱O
slide-7
SLIDE 7

Approximate Ridge Regression

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

  • Sketched solution: 𝐱O
  • Sketch size 𝑃

R

S T sketch size

slide-8
SLIDE 8

Approximate Ridge Regression

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

  • Sketched solution: 𝐱O
  • Sketch size 𝑃

R

S T

  • 𝑔 𝐱O ≀ 1 + πœ— min

𝐱 𝑔 𝐱

sketch size

Optimization Perspective

slide-9
SLIDE 9

Approximate Ridge Regression

min

𝐱 𝑔 𝐱 = 1

π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

Statistical Perspective

  • Bias
  • Variance
slide-10
SLIDE 10

Related Work

  • Least squares regression: min

𝐱

𝐘𝐱 βˆ’ 𝐳

7 7

Reference

  • Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006.
  • Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011.
  • Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC, 2013.
  • Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research,

2015.

  • Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least
  • squares. Journal of Machine Learning Research, 2015.
  • Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of

Machine Learning Research, 2016.

  • Etc …
slide-11
SLIDE 11

Sketched Ridge Regression

slide-12
SLIDE 12

Matrix Sketching

𝐘 𝐓[𝐘

  • Turn big matrix into smaller one.
  • 𝐘 ∈ ℝ^Γ—S ➑ 𝐓[𝐘 ∈ ℝ_Γ—S.
  • 𝐓 ∈ ℝ^Γ—_ is called sketching matrix, e.g.,
  • Uniform sampling
  • Leverage score sampling
  • Gaussian projection
  • Subsampled randomized Hadamard transform (SRHT)
  • Count sketch (sparse embedding)
  • Etc.
slide-13
SLIDE 13

Matrix Sketching

𝐘 𝐓[𝐘

  • Some matrix sketching methods are efficient.
  • Time cost is o(π‘œπ‘’π‘‘) β€” lower than multiplication.
  • Examples:
  • Leverage score sampling: 𝑃(π‘œπ‘’ log π‘œ) time
  • SRHT: 𝑃(π‘œπ‘’ log 𝑑) time
slide-14
SLIDE 14

Ridge Regression

  • Objective function:

𝑔 𝐱 = 1 π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7

  • Optimal solution:

𝐱⋆ = argmin

𝐱

𝑔 𝐱 = 𝐘[𝐘 + π‘œπ›Ώπ‰S e 𝐘[𝐳

  • Time cost: 𝑃 π‘œπ‘’7 or 𝑃 π‘œπ‘’π‘’
slide-15
SLIDE 15

Sketched Ridge Regression

  • Goal: efficiently and approximately solve

argmin

𝐱

𝑔 𝐱 = 1 π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7 .

slide-16
SLIDE 16

Sketched Ridge Regression

  • Goal: efficiently and approximately solve

argmin

𝐱

𝑔 𝐱 = 1 π‘œ 𝐘𝐱 βˆ’ 𝐳

7 7 + 𝛿 𝐱 7 7 .

  • Approach: reduce the size of 𝐘 and 𝐳 by matrix sketching.
slide-17
SLIDE 17

Sketched Ridge Regression

  • Sketched solution:

𝐱O = argmin

𝐱

1 π‘œ 𝐓[𝐘𝐱 βˆ’ 𝐓[𝐳

7 7 + 𝛿 𝐱 7 7

= 𝐘[𝐓𝐓[𝐘 + π‘œπ›Ώπ‰S e 𝐘[𝐓𝐓[𝐳

slide-18
SLIDE 18

Sketched Ridge Regression

  • Sketched solution:
  • Time: 𝑃 𝑑𝑒7 + π‘ˆ

_

  • π‘ˆ

_ is the cost of sketching 𝐓[𝐘

  • E.g. π‘ˆ

_ = 𝑃(π‘œπ‘’ log 𝑑) for SRHT.

  • E.g. π‘ˆ

_ = 𝑃 π‘œπ‘’ log π‘œ for leverage score sampling.

𝐱O = argmin

𝐱

1 π‘œ 𝐓[𝐘𝐱 βˆ’ 𝐓[𝐳

7 7 + 𝛿 𝐱 7 7

= 𝐘[𝐓𝐓[𝐘 + π‘œπ›Ώπ‰S e 𝐘[𝐓𝐓[𝐳

slide-19
SLIDE 19

Theory: Optimization Perspective

slide-20
SLIDE 20

Optimization Perspective

  • Recall the objective function 𝑔 𝐱 = i

^ 𝐘𝐱 βˆ’ 𝐳 7 7 + 𝛿 𝐱 7 7.

  • Bound 𝑔 𝐱O βˆ’ 𝑔 𝐱⋆ .
  • i

^ 𝐘𝐱O βˆ’ π˜π±β‹† 7 7 ≀ 𝑔 𝐱O βˆ’ 𝑔 𝐱⋆ .

slide-21
SLIDE 21

Optimization Perspective

For the sketching methods

  • SRHT or leverage sampling with s = 𝑃

R

jS T ,

  • uniform sampling with s = 𝑃

k jS lmn S T

, 𝑔 𝐱O βˆ’ 𝑔 𝐱⋆ ≀ πœ—π‘” 𝐱⋆ holds w.p. 0.9.

  • 𝐘 ∈ ℝ^Γ—S: the design matrix
  • 𝛿: the regularization parameter
  • 𝛾 =

𝐘 p

p

^qr 𝐘 p

p ∈ (0, 1]

  • 𝜈 ∈ 1,

^ S : the row coherence of 𝐘

slide-22
SLIDE 22

Optimization Perspective

For the sketching methods

  • SRHT or leverage sampling with s = 𝑃

R

jS T ,

  • uniform sampling with s = 𝑃

k jS lmn S T

, 𝑔 𝐱O βˆ’ 𝑔 𝐱⋆ ≀ πœ—π‘” 𝐱⋆ holds w.p. 0.9.

i ^ 𝐘𝐱O βˆ’ π˜π±β‹† 7 7 ≀ πœ—π‘” 𝐱⋆ .

  • 𝐘 ∈ ℝ^Γ—S: the design matrix
  • 𝛿: the regularization parameter
  • 𝛾 =

𝐘 p

p

^qr 𝐘 p

p ∈ (0, 1]

  • 𝜈 ∈ 1,

^ S : the row coherence of 𝐘

slide-23
SLIDE 23

Theory: Statistical Perspective

slide-24
SLIDE 24

Statistical Model

  • 𝐘 ∈ ℝ^Γ—S: fixed design matrix
  • 𝐱x ∈ ℝS: the true and unknown model
  • 𝐳 = 𝐘𝐱x + 𝛆: observed response vector
  • πœ€i, β‹― , πœ€^ are random noise
  • 𝔽 𝛆 = 𝟏 and

𝔽 𝛆𝛆[ = 𝜊7𝐉^

slide-25
SLIDE 25

Bias-Variance Decomposition

  • Risk: 𝑆 𝐱 =

i ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱x 7 7

  • 𝔽 is taken w.r.t. the random noise 𝛆.
slide-26
SLIDE 26

Bias-Variance Decomposition

  • Risk: 𝑆 𝐱 =

i ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱x 7 7

  • 𝔽 is taken w.r.t. the random noise 𝛆.
  • Risk measures prediction error.
slide-27
SLIDE 27

Bias-Variance Decomposition

  • Risk: 𝑆 𝐱 =

i ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱x 7 7

  • R 𝐱 = bias7 𝐱 + var 𝐱
slide-28
SLIDE 28

Bias-Variance Decomposition

  • Risk: 𝑆 𝐱 =

i ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱x 7 7

  • R 𝐱 = bias7 𝐱 + var 𝐱
  • bias 𝐱⋆ = 𝛿 π‘œ
  • 𝚻7 + π‘œπ›Ώπ‰S Ζ’iπš»π–[𝐱x

7,

  • var 𝐱⋆ = …p

^

𝐉S + π‘œπ›Ώπš»Ζ’7 Ζ’i

7 7,

  • bias 𝐱O = 𝛿 π‘œ
  • πš»π•[𝐓𝐓[π•πš» + π‘œπ›Ώπ‰S eπš»π–[𝐱x

7,

  • var 𝐱O = …p

^

𝐕[𝐓𝐓[𝐕 + π‘œπ›Ώπš»Ζ’7 e𝐕[𝐓𝐓[

7 7

,

  • Here 𝐘 = π•πš»π–[ is the SVD.

Optimal Solution Sketched Solution

slide-29
SLIDE 29

Statistical Perspective

For the sketching methods

  • SRHT or leverage sampling with s = 𝑃

R

S Tp ,

  • uniform sampling with s = 𝑃

k S lmn S Tp

,

the followings hold w.p. 0.9:

1 βˆ’ πœ— ≀ bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ—, 1 βˆ’ πœ— π‘œ 𝑑 ≀ var 𝐱O var 𝐱⋆ ≀ 1 + πœ— π‘œ 𝑑 .

  • 𝐘 ∈ ℝ^Γ—S: the design matrix
  • 𝜈 ∈ 1,

^ S : the row coherence of 𝐘

Good! Bad! Because π‘œ ≫ 𝑑.

slide-30
SLIDE 30

Statistical Perspective

For the sketching methods

  • SRHT or leverage sampling with s = 𝑃

R

S Tp ,

  • uniform sampling with s = 𝑃

k S lmn S Tp

,

the followings hold w.p. 0.9:

1 βˆ’ πœ— ≀ bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ—, 1 βˆ’ πœ— π‘œ 𝑑 ≀ var 𝐱O var 𝐱⋆ ≀ 1 + πœ— π‘œ 𝑑 .

  • 𝐘 ∈ ℝ^Γ—S: the design matrix
  • 𝜈 ∈ 1,

^ S : the row coherence of 𝐘

If 𝐳 is noisy

variance dominates bias 𝑆 𝐱_ ≫ 𝑆(𝐱⋆).

slide-31
SLIDE 31

Conclusions

  • Use sketched solution to initialize numerical
  • ptimization.
  • 𝐘𝐱O is close to π˜π±β‹†.

Optimization Perspective

slide-32
SLIDE 32

Conclusions

  • Use sketched solution to initialize numerical
  • ptimization.
  • 𝐘𝐱O is close to π˜π±β‹†.
  • 𝐱 ‑ : output of the 𝑒-th iteration of CG algorithm.
  • 𝐘𝐱 Ε  Ζ’π˜π±β‹†

p p

𝐘𝐱 β€Ή Ζ’π˜π±β‹†

p p ≀ 2

  • 𝐘Ž𝐘
  • Ζ’i
  • 𝐘Ž𝐘
  • ri

‑

.

  • Initialization is important.

Optimization Perspective

slide-33
SLIDE 33

Conclusions

  • Use sketched solution to initialize numerical
  • ptimization.
  • 𝐘𝐱O is close to π˜π±β‹†.
  • Never use sketched solution to replace the
  • ptimal solution.
  • Much higher variance Γ¨ bad generalization.

Optimization Perspective Statistical Perspective

slide-34
SLIDE 34

Model Averaging

slide-35
SLIDE 35

Model Averaging

  • Independently draw 𝐓i, β‹― , 𝐓‒.
  • Compute the sketched solutions 𝐱i

O, β‹― , 𝐱‒ O.

  • Model averaging: 𝐱O =

i

  • βˆ‘

π±β€˜

O

  • β€˜β€™i

.

slide-36
SLIDE 36

Optimization Perspective

  • For sufficiently large 𝑑,

β€œ 𝐱”

  • Ζ’β€œ 𝐱⋆

β€œ 𝐱⋆

≀ πœ— holds w.h.p.

Without model averaging

slide-37
SLIDE 37

Optimization Perspective

  • For sufficiently large 𝑑,

β€œ 𝐱”

  • Ζ’β€œ 𝐱⋆

β€œ 𝐱⋆

≀ πœ— holds w.h.p.

  • Using the same matrix sketching and same 𝑑,

β€œ 𝐱‒ Ζ’β€œ 𝐱⋆ β€œ 𝐱⋆

≀ T

  • + πœ—7

holds w.h.p.

Without model averaging With model averaging

slide-38
SLIDE 38

Optimization Perspective

  • For sufficiently large 𝑑,

β€œ 𝐱”

  • Ζ’β€œ 𝐱⋆

β€œ 𝐱⋆

≀ πœ— holds w.h.p.

  • Using the same matrix sketching and same 𝑑,

β€œ 𝐱‒ Ζ’β€œ 𝐱⋆ β€œ 𝐱⋆

≀ T

  • + πœ—7

holds w.h.p.

Without model averaging With model averaging

slide-39
SLIDE 39

Optimization Perspective

  • For sufficiently large 𝑑,

β€œ 𝐱”

  • Ζ’β€œ 𝐱⋆

β€œ 𝐱⋆

≀ πœ— holds w.h.p.

  • Using the same matrix sketching and same 𝑑,

β€œ 𝐱‒ Ζ’β€œ 𝐱⋆ β€œ 𝐱⋆

≀ T

  • + πœ—7

holds w.h.p.

Without model averaging With model averaging

If 𝑑 ≫ 𝑒 πœ—7 is very small error bound ∝

T

  • .
slide-40
SLIDE 40

Statistical Perspective

  • Risk: 𝑆 𝐱 = i

^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱x 7 7 = bias7 𝐱 + var 𝐱

  • Model averaging :
  • bias 𝐱O = 𝛿 π‘œ
  • i
  • βˆ‘

πš»π•[π“β€˜π“β€˜

[π•πš» + π‘œπ›Ώπ‰S eπš»π–[𝐱x

  • β€˜β€™i

7

,

  • var 𝐱O = …p

^ i

  • βˆ‘

𝐕[π“β€˜π“β€˜

[𝐕 + π‘œπ›Ώπš»Ζ’7 e𝐕[π“β€˜π“β€˜ [

  • β€˜β€™i

7 7

.

  • Here 𝐘 = π•πš»π–[ is the SVD.
slide-41
SLIDE 41

Statistical Perspective

  • For sufficiently large 𝑑, the followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≀ π‘œ 𝑑 1 + πœ— .

  • Using the same sketching methods and same 𝑑, the

followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≲ π‘œ 𝑑 1 𝑕

  • + πœ—

πŸ‘

Without model averaging

slide-42
SLIDE 42

Statistical Perspective

  • For sufficiently large 𝑑, the followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≀ π‘œ 𝑑 1 + πœ— .

  • Using the same sketching methods and same 𝑑, the

followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≲ π‘œ 𝑑 1 𝑕

  • + πœ—

πŸ‘

Without model averaging With model averaging

slide-43
SLIDE 43

Statistical Perspective

  • For sufficiently large 𝑑, the followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≀ π‘œ 𝑑 1 + πœ— .

  • Using the same sketching methods and same 𝑑, the

followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≲ π‘œ 𝑑 1 𝑕

  • + πœ—

πŸ‘

Without model averaging With model averaging

slide-44
SLIDE 44

Statistical Perspective

  • For sufficiently large 𝑑, the followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≀ π‘œ 𝑑 1 + πœ— .

  • Using the same sketching methods and same 𝑑, the

followings hold w.h.p.:

bias 𝐱O bias 𝐱⋆ ≀ 1 + πœ— and var 𝐱O var 𝐱⋆ ≲ π‘œ 𝑑 1 𝑕

  • + πœ—

πŸ‘

Without model averaging With model averaging

If πœ— is small, then var 𝐱O ∝ i

  • .
slide-45
SLIDE 45

Applications to Distributed Optimization

  • 𝐲i, 𝑧i , β‹― , 𝐲^, 𝑧^

are (randomly) split among 𝑕 machines.

  • Equivalent to uniform

sampling with 𝑑 = ^

  • .
slide-46
SLIDE 46

Optimization Perspective

  • Application to distributed optimization:
  • If 𝑑 = ^
  • ≫ 𝑒, 𝐱O is very close to 𝐱⋆ (provably).
  • 𝐱O is good initialization of distributed optimization algorithms.
slide-47
SLIDE 47

Statistical Perspective

  • Application to distributed machine learning:
  • If 𝑑 = ^
  • ≫ 𝑒, then 𝑆 𝐱O is comparable to 𝑆 𝐱⋆ .
  • If low-precision solution suffices, then 𝐱O is a good substitute of 𝐱⋆.
  • One-shot solution.
slide-48
SLIDE 48

Thank You!

The paper is at arXiv:1702.04837