Less is More: Computational Regularization by Subsampling Lorenzo - - PowerPoint PPT Presentation

less is more computational regularization by subsampling
SMART_READER_LITE
LIVE PREVIEW

Less is More: Computational Regularization by Subsampling Lorenzo - - PowerPoint PPT Presentation

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris A


slide-1
SLIDE 1

Less is More: Computational Regularization by Subsampling

Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu

joint work with Alessandro Rudi, Raffaello Camoriano

Paris

slide-2
SLIDE 2

A Starting Point

Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization

slide-3
SLIDE 3

A Starting Point

Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization!

(Bottou, Bousquet ’08)

slide-4
SLIDE 4

A Starting Point

Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization!

(Bottou, Bousquet ’08)

Computational Regularization: Computation “tricks”=regularization

slide-5
SLIDE 5

Supervised Learning Problem: Estimate f ∗

f∗

slide-6
SLIDE 6

Supervised Learning Problem:

Estimate f ∗ given Sn = {(x1, y1), . . . , (xn, yn)} f∗

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

slide-7
SLIDE 7

Supervised Learning Problem:

Estimate f ∗ given Sn = {(x1, y1), . . . , (xn, yn)} f∗

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

The Setting yi = f ∗(xi) + εi i ∈ {1, . . . , n}

◮ εi ∈ R, xi ∈ Rd random (bounded but with unknown distribution) ◮ f ∗ unknown

slide-8
SLIDE 8

Outline

Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

slide-9
SLIDE 9

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

slide-10
SLIDE 10

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function

slide-11
SLIDE 11

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers

slide-12
SLIDE 12

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients

slide-13
SLIDE 13

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients ◮ M = Mn could/should grow with n

slide-14
SLIDE 14

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients ◮ M = Mn could/should grow with n

Question: How to choose wi, ci and M given Sn ?

slide-15
SLIDE 15

Learning with Positive Definite Kernels

There is an elegant answer if:

◮ q is symmetric ◮ all the matrices

Qij = q(xi, xj) are positive semi-definite1

1They have non-negative eigenvalues

slide-16
SLIDE 16

Learning with Positive Definite Kernels

There is an elegant answer if:

◮ q is symmetric ◮ all the matrices

Qij = q(xi, xj) are positive semi-definite1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨

  • lkopf et al. ’01)

◮ M = n, ◮ wi = xi, ◮ ci by convex optimization!

1They have non-negative eigenvalues

slide-17
SLIDE 17

Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization

  • fλ = argmin

f∈H

1 n

n

  • i=1

(yi − f(xi))2 + λf2 where2 H = {f | f(x) =

M

  • i=1

ciq(x, wi), ci ∈ R, wi ∈ Rd

  • any center!

, M ∈ N

any length!

}

2The norm is induced by the inner product f, f′ = i,j cic′ jq(xi, xj)

slide-18
SLIDE 18

Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization

  • fλ = argmin

f∈H

1 n

n

  • i=1

(yi − f(xi))2 + λf2 where2 H = {f | f(x) =

M

  • i=1

ciq(x, wi), ci ∈ R, wi ∈ Rd

  • any center!

, M ∈ N

any length!

}

Solution

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y

2The norm is induced by the inner product f, f′ = i,j cic′ jq(xi, xj)

slide-19
SLIDE 19

KRR: Statistics

slide-20
SLIDE 20

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

slide-21
SLIDE 21

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

slide-22
SLIDE 22

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
slide-23
SLIDE 23

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
  • 2. More refined results for smooth kernels

λ∗ = n−

1 2s+1 ,

E ( fλ∗(x) − f ∗(x))2 n−

2s 2s+1

slide-24
SLIDE 24

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
  • 2. More refined results for smooth kernels

λ∗ = n−

1 2s+1 ,

E ( fλ∗(x) − f ∗(x))2 n−

2s 2s+1

  • 3. Adaptive tuning, e.g. via cross validation
  • 4. Proofs: inverse problems results + random matrices

(Smale and Zhou + Caponnetto, De Vito, R.)

slide-25
SLIDE 25

KRR: Optimization

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y Linear System

b Q

b y

c =

Complexity

◮ Space O(n2) ◮ Time O(n3)

slide-26
SLIDE 26

KRR: Optimization

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y Linear System

b Q

b y

c =

Complexity

◮ Space O(n2) ◮ Time O(n3)

BIG DATA?

Running out of time and space ... Can this be fixed?

slide-27
SLIDE 27

Beyond Tikhonov: Spectral Filtering

( ˆ Q + λI)−1 approximation of ˆ Q† controlled by λ

slide-28
SLIDE 28

Beyond Tikhonov: Spectral Filtering

( ˆ Q + λI)−1 approximation of ˆ Q† controlled by λ Can we approximate ˆ Q† by saving computations?

slide-29
SLIDE 29

Beyond Tikhonov: Spectral Filtering

( ˆ Q + λI)−1 approximation of ˆ Q† controlled by λ Can we approximate ˆ Q† by saving computations?

Yes!

slide-30
SLIDE 30

Beyond Tikhonov: Spectral Filtering

( ˆ Q + λI)−1 approximation of ˆ Q† controlled by λ Can we approximate ˆ Q† by saving computations?

Yes!

Spectral filtering (Engl ’96- inverse problems, Rosasco et al.

05- ML )

gλ( ˆ Q) ∼ ˆ Q† The filter function gλ defines the form of the approximation

slide-31
SLIDE 31

Spectral filtering

Examples

◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L2-boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .

slide-32
SLIDE 32

Spectral filtering

Examples

◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L2-boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .

Landweber iteration (truncated power series). . . ct = gt( ˆ Q) = γ

t−1

  • r=0

(I − γ ˆ Q)r y

slide-33
SLIDE 33

Spectral filtering

Examples

◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L2-boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .

Landweber iteration (truncated power series). . . ct = gt( ˆ Q) = γ

t−1

  • r=0

(I − γ ˆ Q)r y . . . it’s GD for ERM!! r = 1 . . . t cr = cr−1 − γ( ˆ Qcr−1 − ˆ y), c0 = 0

slide-34
SLIDE 34

Statistics and computations with spectral filtering

The different filters achieve essentially the same optimal statistical error!

slide-35
SLIDE 35

Statistics and computations with spectral filtering

The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space Tikhonov n3 n2 GD n2λ−1

n2 Accelerated GD n2λ−1/2

n2 Truncated SVD n2λ−γ

n2 Notet: λ−1

= t, for iterative methods

slide-36
SLIDE 36

Semiconvergence

Iteration

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Error

0.12 0.14 0.16 0.18 0.2 0.22 0.24 Empirical Error Expected Error

◮ Iterations control statistics and time complexity

slide-37
SLIDE 37

Computational Regularization

slide-38
SLIDE 38

Computational Regularization BIG DATA?

Running out of ✭✭✭✭ ❤❤❤❤ time and space ...

slide-39
SLIDE 39

Computational Regularization BIG DATA?

Running out of ✭✭✭✭ ❤❤❤❤ time and space ... Is there a principle to control statistics, time and space complexity?

slide-40
SLIDE 40

Outline

Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

slide-41
SLIDE 41

Subsampling

  • 1. pick wi at random...
slide-42
SLIDE 42

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

slide-43
SLIDE 43

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}.

slide-44
SLIDE 44

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}. Linear System b y

c

=

b QM Complexity

◮ Space ✟✟

✟ O(n2) → O(nM)

◮ Time ✟✟

✟ O(n3) → O(nM 2)

slide-45
SLIDE 45

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}. Linear System b y

c

=

b QM Complexity

◮ Space ✟✟

✟ O(n2) → O(nM)

◮ Time ✟✟

✟ O(n3) → O(nM 2) What about statistics? What’s the price for efficient computations?

slide-46
SLIDE 46

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

slide-47
SLIDE 47

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

◮ Theoretical guarantees mainly on matrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

Q − QM 1 √ M

slide-48
SLIDE 48

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

◮ Theoretical guarantees mainly on matrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

Q − QM 1 √ M

◮ Statistical guarantees suboptimal or in restricted setting

(Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

slide-49
SLIDE 49

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n

slide-50
SLIDE 50

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

slide-51
SLIDE 51

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
slide-52
SLIDE 52

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
slide-53
SLIDE 53

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
  • 3. More generally,

λ∗ = n−

1 2s+1 ,

M∗ = 1 λ∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

slide-54
SLIDE 54

Main Result

(Rudi, Camoriano, Rosasco, ’15)

Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
  • 3. More generally,

λ∗ = n−

1 2s+1 ,

M∗ = 1 λ∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

Note: An interesting insight is obtained rewriting the result. . .

slide-55
SLIDE 55

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

slide-56
SLIDE 56

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

slide-57
SLIDE 57

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

slide-58
SLIDE 58

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

slide-59
SLIDE 59

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
slide-60
SLIDE 60

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
  • 2. Pick another center + rank one update
slide-61
SLIDE 61

Computational Regularization by Subsampling

(Rudi, Camoriano, Rosasco, ’15)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H with a smooth kernel, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
  • 2. Pick another center + rank one update
  • 3. Pick another center . . .
slide-62
SLIDE 62

N¨ ystrom CoRe Illustrated

n, λ are fixed

50 100 150 200 250 300

Validation Error

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

Computation controls stability! Time/space requirement tailored to generalization

slide-63
SLIDE 63

Experiments

comparable/better w.r.t. the state of the art

Dataset ntr d Incremental Standard Standard Random Fastfood CoRe KRLS Nystr¨

  • m

Features RF

  • Ins. Co.

5822 85 0.23180 ± 4 × 10−5 0.231 0.232 0.266 0.264 CPU 6554 21 2.8466 ± 0.0497 7.271 6.758 7.103 7.366 CT slices 42800 384 7.1106 ± 0.0772 NA 60.683 49.491 43.858 Year Pred. 463715 90 0.10470 ± 5 × 10−5 NA 0.113 0.123 0.115 Forest 522910 54 0.9638 ± 0.0186 NA 0.837 0.840 0.840 ◮ Random Features (Rahimi, Recht ’07) ◮ Fastfood (Le et al. ’13)

slide-64
SLIDE 64

Summary so far

◮ Optimal learning with data dependent subsampling ◮ Computational regularization: subsampling regularizes!

slide-65
SLIDE 65

Summary so far

◮ Optimal learning with data dependent subsampling ◮ Computational regularization: subsampling regularizes!

Few more questions:

◮ Can one do better than uniform sampling?

slide-66
SLIDE 66

Summary so far

◮ Optimal learning with data dependent subsampling ◮ Computational regularization: subsampling regularizes!

Few more questions:

◮ Can one do better than uniform sampling?

Yes: leverage score sampling...

◮ What about data independent sampling?

slide-67
SLIDE 67

Outline

Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

slide-68
SLIDE 68

Random Features

  • f(x) =

M

  • i=1

ci q(x, wi)

slide-69
SLIDE 69

Random Features

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q general non linear function

slide-70
SLIDE 70

Random Features

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q general non linear function ◮ pick ˜

wi at random according to a distribution µ ˜ w1, . . . , ˜ wM ∼ µ

slide-71
SLIDE 71

Random Features

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q general non linear function ◮ pick ˜

wi at random according to a distribution µ ˜ w1, . . . , ˜ wM ∼ µ

◮ perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R}.

slide-72
SLIDE 72

Random Fourier Features

(Rahimi, Recht ’07)

Consider q(x, w) = eiwT x,

slide-73
SLIDE 73

Random Fourier Features

(Rahimi, Recht ’07)

Consider q(x, w) = eiwT x, w ∼ µ(w) = N(0, I)

slide-74
SLIDE 74

Random Fourier Features

(Rahimi, Recht ’07)

Consider q(x, w) = eiwT x, w ∼ µ(w) = N(0, I) Then Ew

  • q(x, w)q(x′, w)
  • = e−x−x′2γ = K(x, x′)
slide-75
SLIDE 75

Random Fourier Features

(Rahimi, Recht ’07)

Consider q(x, w) = eiwT x, w ∼ µ(w) = N(0, I) Then Ew

  • q(x, w)q(x′, w)
  • = e−x−x′2γ = K(x, x′)

By sampling ˜ w1, . . . , ˜ wM we are considering the approximating kernel 1 M

M

  • i=1
  • q(x, ˜

wi)q(x′, ˜ wi)

  • =

KM(x, x′)

slide-76
SLIDE 76

More Random Features

◮ translation invariant kernels K(x, x′) = H(x − x′),

q(x, w) = eiwT x, w ∼ µ = F(H)

◮ infinite neural nets kernels

q(x, w) = |wT x + b|+, (w, b) ∼ µ = U[Sd]

◮ infinite dot product kernels ◮ homogeneous additive kernels ◮ group invariant kernels ◮ . . .

Note: Connections with hashing and sketching techniques.

slide-77
SLIDE 77

Properties of Random Features

slide-78
SLIDE 78

Properties of Random Features

Optimization

◮ Time: ✟✟

✟ O(n3) O(nM 2)

◮ Space: ✟✟

✟ O(n2) O(nM)

slide-79
SLIDE 79

Properties of Random Features

Optimization

◮ Time: ✟✟

✟ O(n3) O(nM 2)

◮ Space: ✟✟

✟ O(n2) O(nM) Statistics As before: do we pay a price for efficient computations?

slide-80
SLIDE 80

Previous works

slide-81
SLIDE 81

Previous works

◮ *Many* different random features for different kernels

(Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+)

slide-82
SLIDE 82

Previous works

◮ *Many* different random features for different kernels

(Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+)

◮ Theoretical guarantees: mainly kernel approximation

(Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15)

|K(x, x′) − KM(x, x′)| 1 √ M

slide-83
SLIDE 83

Previous works

◮ *Many* different random features for different kernels

(Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+)

◮ Theoretical guarantees: mainly kernel approximation

(Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15)

|K(x, x′) − KM(x, x′)| 1 √ M

◮ Statistical guarantees suboptimal or in restricted setting

(Rahimi, Recht ’09, Yang et al. ’13 . . . ,Bach ’15 )

slide-84
SLIDE 84

Main Result

Let q(x, w) = eiwT x,

slide-85
SLIDE 85

Main Result

Let q(x, w) = eiwT x, w ∼ µ(w) = cd

  • 1

1 + w2 d+1

2

slide-86
SLIDE 86

Main Result

Let q(x, w) = eiwT x, w ∼ µ(w) = cd

  • 1

1 + w2 d+1

2

Theorem

If f∗ ∈ Hs Sobolev space, then λ∗ = n−

1 2s+1 ,

M∗ = 1 λ2s

, E ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

slide-87
SLIDE 87

Main Result

Let q(x, w) = eiwT x, w ∼ µ(w) = cd

  • 1

1 + w2 d+1

2

Theorem

If f∗ ∈ Hs Sobolev space, then λ∗ = n−

1 2s+1 ,

M∗ = 1 λ2s

, E ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ Random features achieve optimal bound!

slide-88
SLIDE 88

Main Result

Let q(x, w) = eiwT x, w ∼ µ(w) = cd

  • 1

1 + w2 d+1

2

Theorem

If f∗ ∈ Hs Sobolev space, then λ∗ = n−

1 2s+1 ,

M∗ = 1 λ2s

, E ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ Random features achieve optimal bound! ◮ Efficient worst case subsampling M∗ ∼ √n– but cannot exploit

smoothness.

slide-89
SLIDE 89

Remarks & Extensions

N¨ ystrom vs Random features

◮ Both achieve optimal rates ◮ N¨

ystrom seems to need fewer samples (random centers)

slide-90
SLIDE 90

Remarks & Extensions

N¨ ystrom vs Random features

◮ Both achieve optimal rates ◮ N¨

ystrom seems to need fewer samples (random centers) How tight are the results?

slide-91
SLIDE 91

Remarks & Extensions

N¨ ystrom vs Random features

◮ Both achieve optimal rates ◮ N¨

ystrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M

2 4 6 8 10 12 14 8 7.5 7 6.5 6 5.5 − 3.6 − 3.4 − 3.2 − 3 − 2.8 − 2.6 − 2.4 − 2.2 2 4 6 8 10 12 14 1 2 3 4 5 2 4 6 8 10 12 14

slide-92
SLIDE 92

Contributions

◮ Optimal bounds for data dependent/independent subsampling ◮ Subsampling: N¨

ystrom vs Random features

◮ Beyond ridge regression: early stopping and multiple passes SGD

(see arxiv)

slide-93
SLIDE 93

Contributions

◮ Optimal bounds for data dependent/independent subsampling ◮ Subsampling: N¨

ystrom vs Random features

◮ Beyond ridge regression: early stopping and multiple passes SGD

(see arxiv) Some questions:

◮ Quest for the best sampling ◮ Regularization by projection: inverse problems and preconditioning ◮ Beyond randomization: non convex neural nets optimization?

slide-94
SLIDE 94

Contributions

◮ Optimal bounds for data dependent/independent subsampling ◮ Subsampling: N¨

ystrom vs Random features

◮ Beyond ridge regression: early stopping and multiple passes SGD

(see arxiv) Some questions:

◮ Quest for the best sampling ◮ Regularization by projection: inverse problems and preconditioning ◮ Beyond randomization: non convex neural nets optimization?

Some perspectives:

◮ Computational regularization: subsampling regularizes ◮ Algorithm design: control stability for good

statistics/computations