Less is More: Nystr om Computational Regularization Alessandro Rudi - - PowerPoint PPT Presentation

less is more nystr om computational regularization
SMART_READER_LITE
LIVE PREVIEW

Less is More: Nystr om Computational Regularization Alessandro Rudi - - PowerPoint PPT Presentation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015 A


slide-1
SLIDE 1

Less is More: Nystr¨

  • m Computational Regularization

Alessandro Rudi, Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015

slide-2
SLIDE 2

A Starting Point

Classically: Statistics and optimization distinct steps in algorithm design

slide-3
SLIDE 3

A Starting Point

Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization!

(Bottou, Bousquet ’08)

slide-4
SLIDE 4

Supervised Learning Problem: Estimate f ∗

f∗

slide-5
SLIDE 5

Supervised Learning Problem:

Estimate f ∗ given Sn = {(x1, y1), . . . , (xn, yn)} f∗

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

slide-6
SLIDE 6

Supervised Learning Problem:

Estimate f ∗ given Sn = {(x1, y1), . . . , (xn, yn)} f∗

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

The Setting yi = f ∗(xi) + εi i ∈ {1, . . . , n}

◮ εi ∈ R, xi ∈ Rd random (with unknown distribution) ◮ f ∗ unknown

slide-7
SLIDE 7

Outline

Learning with kernels Data Dependent Subsampling

slide-8
SLIDE 8

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

slide-9
SLIDE 9

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function

slide-10
SLIDE 10

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers

slide-11
SLIDE 11

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients

slide-12
SLIDE 12

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients ◮ M = Mn could/should grow with n

slide-13
SLIDE 13

Non-linear/non-parametric learning

  • f(x) =

M

  • i=1

ci q(x, wi)

◮ q non linear function ◮ wi ∈ Rd centers ◮ ci ∈ R coefficients ◮ M = Mn could/should grow with n

Question: How to choose wi, ci and M given Sn ?

slide-14
SLIDE 14

Learning with Positive Definite Kernels

There is an elegant answer if:

◮ q is symmetric ◮ all the matrices

Qij = q(xi, xj) are positive semi-definite1

1They have non-negative eigenvalues

slide-15
SLIDE 15

Learning with Positive Definite Kernels

There is an elegant answer if:

◮ q is symmetric ◮ all the matrices

Qij = q(xi, xj) are positive semi-definite1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨

  • lkopf et al. ’01)

◮ M = n, ◮ wi = xi, ◮ ci by convex optimization!

1They have non-negative eigenvalues

slide-16
SLIDE 16

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares

  • fλ = argmin

f∈H

1 n

n

  • i=1

(yi − f(xi))2 + λf2

slide-17
SLIDE 17

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares

  • fλ = argmin

f∈H

1 n

n

  • i=1

(yi − f(xi))2 + λf2 where H = {f | f(x) =

M

  • i=1

ciq(x, wi), ci ∈ R, wi ∈ Rd

  • any center!

, M ∈ N

any length!

}

slide-18
SLIDE 18

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares

  • fλ = argmin

f∈H

1 n

n

  • i=1

(yi − f(xi))2 + λf2 where H = {f | f(x) =

M

  • i=1

ciq(x, wi), ci ∈ R, wi ∈ Rd

  • any center!

, M ∈ N

any length!

}

Solution

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y

slide-19
SLIDE 19

KRR: Statistics

slide-20
SLIDE 20

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

slide-21
SLIDE 21

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

slide-22
SLIDE 22

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
slide-23
SLIDE 23

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
  • 2. Results for general kernels (e.g. splines/Sobolev etc.)

λ∗ = n−

1 2s+1 ,

E ( fλ∗(x) − f ∗(x))2 n−

2s 2s+1

slide-24
SLIDE 24

KRR: Statistics

Well understood statistical properties:

Classical Theorem

If f ∗ ∈ H, then λ∗ = 1 √n E ( fλ∗(x) − f ∗(x))2 1 √n

Remarks

  • 1. Optimal nonparametric bound
  • 2. Results for general kernels (e.g. splines/Sobolev etc.)

λ∗ = n−

1 2s+1 ,

E ( fλ∗(x) − f ∗(x))2 n−

2s 2s+1

  • 3. Adaptive tuning via cross validation
slide-25
SLIDE 25

KRR: Optimization

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y Linear System

b Q

b y

c =

Complexity

◮ Space O(n2) ◮ Time O(n3)

slide-26
SLIDE 26

KRR: Optimization

  • fλ =

n

  • i=1

ci q(x, xi) with c = ( Q + λnI)−1 y Linear System

b Q

b y

c =

Complexity

◮ Space O(n2) ◮ Time O(n3)

BIG DATA?

Running out of space before running out of time... Can this be fixed?

slide-27
SLIDE 27

Outline

Learning with kernels Data Dependent Subsampling

slide-28
SLIDE 28

Subsampling

  • 1. pick wi at random...
slide-29
SLIDE 29

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

slide-30
SLIDE 30

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}.

slide-31
SLIDE 31

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}. Linear System b y

c

=

b QM Complexity

◮ Space ✟✟

✟ O(n2) → O(nM)

◮ Time ✟✟

✟ O(n3) → O(nM 2)

slide-32
SLIDE 32

Subsampling

  • 1. pick wi at random... from training set

(Smola, Scholk¨

  • pf ’00)

˜ w1, . . . , ˜ wM ⊂ x1, . . . xn M ≪ n

  • 2. perform KRR on

HM = {f | f(x) =

M

  • i=1

ciq(x, ˜ wi), ci ∈ R, ✘✘✘ ✘ wi ∈ Rd , ✘✘✘ ✘ M ∈ N}. Linear System b y

c

=

b QM Complexity

◮ Space ✟✟

✟ O(n2) → O(nM)

◮ Time ✟✟

✟ O(n3) → O(nM 2) What about statistics? What’s the price for efficient computations?

slide-33
SLIDE 33

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

slide-34
SLIDE 34

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

◮ Theoretical guarantees mainly on matrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

Q − QM 1 √ M

slide-35
SLIDE 35

Putting our Result in Context

◮ *Many* different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

◮ Theoretical guarantees mainly on matrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

Q − QM 1 √ M

◮ Few prediction guarantees either suboptimal or in restricted

setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

slide-36
SLIDE 36

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n

slide-37
SLIDE 37

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

slide-38
SLIDE 38

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
slide-39
SLIDE 39

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
slide-40
SLIDE 40

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
  • 3. More generally,

λ∗ = n−

1 2s+1 ,

M∗ = 1 λ∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

slide-41
SLIDE 41

Main Result Theorem

If f ∗ ∈ H, then λ∗ = 1 √n , M∗ = 1 λ∗ , E ( fλ∗,M∗(x) − f ∗(x))2 1 √n Remarks

  • 1. Subsampling achives optimal bound. . .
  • 2. . . . with M∗ ∼ √n !!
  • 3. More generally,

λ∗ = n−

1 2s+1 ,

M∗ = 1 λ∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

Note: An interesting insight is obtained rewriting the result. . .

slide-42
SLIDE 42

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

slide-43
SLIDE 43

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

slide-44
SLIDE 44

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

slide-45
SLIDE 45

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

slide-46
SLIDE 46

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
slide-47
SLIDE 47

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
  • 2. Pick another center + rank one update
slide-48
SLIDE 48

Computational Regularization (CoRe)

A simple idea: “swap” the role of λ and M. . .

Theorem

If f ∗ ∈ H, then M∗ = n

1 2s+1 ,

λ∗ = 1 M∗ , Ex ( fλ∗,M∗(x) − f ∗(x))2 n−

2s 2s+1

◮ λ and M play the same role. . .

. . . new interpretation: subsampling regularizes!

◮ New natural incremental algorithm...

Algorithm

  • 1. Pick a center + compute solution
  • 2. Pick another center + rank one update
  • 3. Pick another center . . .
slide-49
SLIDE 49

CoRe Illustrated

n, λ are fixed

50 100 150 200 250 300

Validation Error

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

Computation controls stability! Time/space requirement tailored to generalization

slide-50
SLIDE 50

Experiments

comparable/better w.r.t. the state of the art

Dataset ntr d Incremental Standard Standard Random Fastfood CoRe KRLS Nystr¨

  • m

Features RF

  • Ins. Co.

5822 85 0.23180 ± 4 × 10−5 0.231 0.232 0.266 0.264 CPU 6554 21 2.8466 ± 0.0497 7.271 6.758 7.103 7.366 CT slices 42800 384 7.1106 ± 0.0772 NA 60.683 49.491 43.858 Year Pred. 463715 90 0.10470 ± 5 × 10−5 NA 0.113 0.123 0.115 Forest 522910 54 0.9638 ± 0.0186 NA 0.837 0.840 0.840 ◮ Random Features (Rahimi, Recht ’07) ◮ Fastfood (Le et al. ’13)

slide-51
SLIDE 51

Contributions

◮ Optimal learning with data dependent subsampling ◮ Beyond uniform sampling – come to the poster!

slide-52
SLIDE 52

Contributions

◮ Optimal learning with data dependent subsampling ◮ Beyond uniform sampling – come to the poster!

Some questions:

◮ Beyond ridge regression– SGD and early stopping ◮ Data independent sampling– random features ◮ Beyond randomization– non convex optimization?

slide-53
SLIDE 53

Contributions

◮ Optimal learning with data dependent subsampling ◮ Beyond uniform sampling – come to the poster!

Some questions:

◮ Beyond ridge regression– SGD and early stopping ◮ Data independent sampling– random features ◮ Beyond randomization– non convex optimization?

Some perspectives:

◮ Computational regularization: subsampling regularizes! ◮ Algorithm design: Control statistics with computations

slide-54
SLIDE 54

Thank you!

Come to poster N.63 for the details!!

CODE: lcsl.github.io/NystromCoRe

Alessandro Rudi - ale_rudi@mit.edu Laboratory for Computational and Statistical Learning - lcsl.mit.edu