Stochastic optimization in Hilbert spaces Aymeric Dieuleveut - - PowerPoint PPT Presentation

stochastic optimization in hilbert spaces
SMART_READER_LITE
LIVE PREVIEW

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut - - PowerPoint PPT Presentation

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48 Outline


slide-1
SLIDE 1

Stochastic optimization in Hilbert spaces

Aymeric Dieuleveut

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48

slide-2
SLIDE 2

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48

Learning vs Statistics

slide-3
SLIDE 3

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48

Learning vs Statistics Tradeoffs of large scale learning Algorithm complexity. ERM ?

slide-4
SLIDE 4

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48

Learning vs Statistics Tradeoffs of large scale learning Algorithm complexity. ERM ? Stochastic optimization Why is SGD so useful in learning ?

slide-5
SLIDE 5

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48

Learning vs Statistics Tradeoffs of large scale learning Algorithm complexity. ERM ? Stochastic optimization Why is SGD so useful in learning ? A simple case Least mean squares, fi- nite dimension

slide-6
SLIDE 6

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48

Learning vs Statistics Tradeoffs of large scale learning Algorithm complexity. ERM ? Stochastic optimization Why is SGD so useful in learning ? A simple case Least mean squares, fi- nite dimension Higher dimension ? RKHS, non parametric learning

slide-7
SLIDE 7

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48

Learning vs Statistics Tradeoffs of large scale learning Algorithm complexity. ERM ? Stochastic optimization Why is SGD so useful in learning ? A simple case Least mean squares, fi- nite dimension Higher dimension ? RKHS, non parametric learning Lower complexity ? Column sampling, fea- ture selection

slide-8
SLIDE 8

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

  • 1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

slide-9
SLIDE 9

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1

  • 1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

slide-10
SLIDE 10

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences :

  • 1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

slide-11
SLIDE 11

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawing conclusions about it.

  • 1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

slide-12
SLIDE 12

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawing conclusions about it. ML are more interested about prediction with a concern on algorithms for high dim. data.

  • 1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

slide-13
SLIDE 13

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) ∈ X × Y, with probability distribution P(x, y). a loss function ℓ : Y × Y → R, a class of function F. the risk of a function f : X → Y is R(f ) := EP[ℓ (f (x), y)]. Our aim is min

f ∈F R(f )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

slide-14
SLIDE 14

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) ∈ X × Y, with probability distribution P(x, y). a loss function ℓ : Y × Y → R, a class of function F. the risk of a function f : X → Y is R(f ) := EP[ℓ (f (x), y)]. Our aim is min

f ∈F R(f )

R is unknown.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

slide-15
SLIDE 15

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) ∈ X × Y, with probability distribution P(x, y). a loss function ℓ : Y × Y → R, a class of function F. the risk of a function f : X → Y is R(f ) := EP[ℓ (f (x), y)]. Our aim is min

f ∈F R(f )

R is unknown. given a sequence of i.i.d. data points distributed (xi, yi)i=1..n ∼ P⊗n, we can define the empirical risk Rn(f ) = 1 n

n

  • i=1

ℓ(f (xi), yi).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

slide-16
SLIDE 16

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

slide-17
SLIDE 17

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case

  • ther regularization

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

slide-18
SLIDE 18

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case

  • ther regularization

Thus compromise : εapp + εest. εapp εest F ր

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

slide-19
SLIDE 19

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case

  • ther regularization

Thus compromise : εapp + εest. εapp εest F ր ց

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

slide-20
SLIDE 20

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case

  • ther regularization

Thus compromise : εapp + εest. εapp εest F ր ց ր This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

slide-21
SLIDE 21

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-22
SLIDE 22

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ?

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-23
SLIDE 23

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points)

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-24
SLIDE 24

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo ?

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-25
SLIDE 25

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo ? more data less work ? (if time is limiting)

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-26
SLIDE 26

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo ? more data less work ? (if time is limiting)

  • 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

slide-27
SLIDE 27

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ր n ր ε ր

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

slide-28
SLIDE 28

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ր n ր ε ր εapp ց

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

slide-29
SLIDE 29

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ր n ր ε ր εapp ց εest ր ց

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

slide-30
SLIDE 30

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ր n ր ε ր εapp ց εest ր ց εopt ր

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

slide-31
SLIDE 31

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ր n ր ε ր εapp ց εest ր ց εopt ր T ր ր ց

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

slide-32
SLIDE 32

Tradeoffs of Large scale learning - Learning

Different algorithms

To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

slide-33
SLIDE 33

Tradeoffs of Large scale learning - Learning

Different algorithms

To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Let’s compare first order methods : SGD and GD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

slide-34
SLIDE 34

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f ) we only access to unbiased estimates of R(f ) and ∇R(f ).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

slide-35
SLIDE 35

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f ) we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

slide-36
SLIDE 36

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f ) we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0. 2 Iterate :

Get unbiased gradient estimate gk, s.t. E[gk] = ∇R(fk). fk+1 ← fk − γkgk.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

slide-37
SLIDE 37

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f ) we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0. 2 Iterate :

Get unbiased gradient estimate gk, s.t. E[gk] = ∇R(fk). fk+1 ← fk − γkgk.

3 Output fm or ¯

fm := 1

m m

  • k=1

fk (averaged SGD). Gradient descent : same but with “true” gradient.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

slide-38
SLIDE 38

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERM minf ∈F Rn(f )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

slide-39
SLIDE 39

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERM minf ∈F Rn(f )

Pick any (xi, yi) from empirical sample

gk = ∇f ℓ(fk, (xi, yi)). fk+1 ← (fk − γkgk) Output ¯ fm

Rn(¯ fm) − Rn(f ∗

n ) O

  • 1/√m
  • supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(d). GD in ERM minf ∈F Rn(f )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

slide-40
SLIDE 40

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERM minf ∈F Rn(f )

Pick any (xi, yi) from empirical sample

gk = ∇f ℓ(fk, (xi, yi)). fk+1 ← (fk − γkgk) Output ¯ fm

Rn(¯ fm) − Rn(f ∗

n ) O

  • 1/√m
  • supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(d). GD in ERM minf ∈F Rn(f ) gk = ∇f n

i=1 ℓ(fk, (xi, yi))

= ∇f R(fk) fk+1 ← (fk − γkgk) Output fm

Rn(fm) − Rn(f ∗

n ) O ((1 − κ)m)

supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(nd). R(¯ fm) − R(f ∗) O

  • 1/√m
  • + O(1/√n)

With step-size γk proportional to

1 √ k .

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

slide-41
SLIDE 41

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

slide-42
SLIDE 42

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD ! Does more data help ? With global estimation error fixed, it seems T ≃

1 R(fm)−R(f∗)− 1

√n is

decreasing with n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

slide-43
SLIDE 43

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD ! Does more data help ? With global estimation error fixed, it seems T ≃

1 R(fm)−R(f∗)− 1

√n is

decreasing with n. Upper bounding Rn − R uniformly is dangerous. Indeed, we have to also compare to one pass SGD, which minimizes the true risk R.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

slide-44
SLIDE 44

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM minf ∈F Rn(f )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

slide-45
SLIDE 45

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM minf ∈F Rn(f )

Pick any (xi, yi) from empirical sample

gk = ∇f ℓ(fk, (xi, yi)). fk+1 ← (fk − γkgk) Output ¯ fm

Rn(¯ fm) − Rn(f ∗

n ) O

  • 1/√m
  • supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(d). SGD one pass minf ∈F R(f )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

slide-46
SLIDE 46

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM minf ∈F Rn(f )

Pick any (xi, yi) from empirical sample

gk = ∇f ℓ(fk, (xi, yi)). fk+1 ← (fk − γkgk) Output ¯ fm

Rn(¯ fm) − Rn(f ∗

n ) O

  • 1/√m
  • supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(d). SGD one pass minf ∈F R(f )

Pick an independent (x, y)

gk = ∇f ℓ(fk, (x, y)). fk+1 ← (fk − γkgk) Output ¯ fk, k n

R(¯ fk) − R(f ∗) O

  • 1/

√ k

  • Cost of one iteration O(d).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

slide-47
SLIDE 47

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM minf ∈F Rn(f )

Pick any (xi, yi) from empirical sample

gk = ∇f ℓ(fk, (xi, yi)). fk+1 ← (fk − γkgk) Output ¯ fm

Rn(¯ fm) − Rn(f ∗

n ) O

  • 1/√m
  • supf ∈F|R − Rn|(f ) O(1/√n)

Cost of one iteration O(d). SGD one pass minf ∈F R(f )

Pick an independent (x, y)

gk = ∇f ℓ(fk, (x, y)). fk+1 ← (fk − γkgk) Output ¯ fk, k n

R(¯ fk) − R(f ∗) O

  • 1/

√ k

  • Cost of one iteration O(d).

SGD with one pass (early stopping as a regularization) achieves a nearly

  • ptimal bias variance tradeoff with low complexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

slide-48
SLIDE 48

Tradeoffs of Large scale learning - Learning

Rate of convergence

We are interested in prediction. Strongly convex objective :

1 µn.

Non strongly :

1 √n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48

slide-49
SLIDE 49

A case study -Finite dimension linear least mean squares

LMS [Bach and Moulines, 2013]

We now consider the simple case where X = Rd, and the loss ℓ is qua-

  • dratic. We are interested in linear predictors :

min

θ∈Rd EP[(θTx − y)2].

If we assume that the data points are generated according to yi = θT

∗ xi + εi.

We consider stochastic gradient algorithm : θ0 = θn+1 = θn − γn(xn, θnxn − ynxn) This system may be rewritten : θn+1 − θ∗ = (I − γxnxT

n )(θn − θ∗) − γnξn.

(1)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48

slide-50
SLIDE 50

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction. Strongly convex objective :

1 µn.

Non strongly :

1 √n.

We define H = E[xxT].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

slide-51
SLIDE 51

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction. Strongly convex objective :

1 µn.

Non strongly :

1 √n.

We define H = E[xxT]. We have µ = min Sp(H).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

slide-52
SLIDE 52

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction. Strongly convex objective :

1 µn.

Non strongly :

1 √n.

We define H = E[xxT]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ2d n

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

slide-53
SLIDE 53

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction. Strongly convex objective :

1 µn.

Non strongly :

1 √n.

We define H = E[xxT]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ2d n there is still a gap to be bridged !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

slide-54
SLIDE 54

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT], and C = E[ξξT].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

slide-55
SLIDE 55

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT], and C = E[ξξT]. Bounded noise variance : we assume C σ2H.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

slide-56
SLIDE 56

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT], and C = E[ξξT]. Bounded noise variance : we assume C σ2H. Covariance operator : no assumption on minimal eigenvalue, E[x2] R2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

slide-57
SLIDE 57

A case study -Finite dimension linear least mean squares

Result

Theorem E[R( ¯ θn) − R(θ∗)] 4 n(σ2d + R2θ0 − θ∗2)

  • ptimal statistical rate

1/n without strong convexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48

slide-58
SLIDE 58

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48

What if d >> n ?

slide-59
SLIDE 59

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48

What if d >> n ? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces

slide-60
SLIDE 60

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48

What if d >> n ? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Non parametric regression in RKHS An interesting problem itself

slide-61
SLIDE 61

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48

What if d >> n ? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Non parametric regression in RKHS An interesting problem itself Behaviour in FD Adaptativity, tradeoffs. Optimal statistical rates in RKHS Choice of γ

slide-62
SLIDE 62

Non parametric learning

Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX . Which is characterized by the kernel function K : X × X → R : for any x, Kx : X → R defined by Kx(x′) = K(x, x′) is in HK.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

slide-63
SLIDE 63

Non parametric learning

Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX . Which is characterized by the kernel function K : X × X → R : for any x, Kx : X → R defined by Kx(x′) = K(x, x′) is in HK. reproducing property : for all g ∈ HK and x ∈ X, g(x) = g, KxK.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

slide-64
SLIDE 64

Non parametric learning

Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX . Which is characterized by the kernel function K : X × X → R : for any x, Kx : X → R defined by Kx(x′) = K(x, x′) is in HK. reproducing property : for all g ∈ HK and x ∈ X, g(x) = g, KxK. Two usages : α) A hypothesis space for regression. β) Mapping data points in a linear space.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

slide-65
SLIDE 65

Non parametric learning

α) A hypothesis space for regression.

Classical regression setting : (Xi, Yi) ∼ ρ i.i.d. (Xi, Yi) ∈ (X × R) Goal : Minimizing prediction error min

g∈L2 E[(g(X) − Y )2].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

slide-66
SLIDE 66

Non parametric learning

α) A hypothesis space for regression.

Classical regression setting : (Xi, Yi) ∼ ρ i.i.d. (Xi, Yi) ∈ (X × R) Goal : Minimizing prediction error min

g∈L2 E[(g(X) − Y )2].

Looking for an estimator ˆ gn of gρ(X) = E[Y |X], gρ ∈ L2

ρX . with

L2

ρX =

  • f : X → R/
  • f 2(t)dρX (t) < ∞
  • .

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

slide-67
SLIDE 67

Non parametric learning

β) Mapping data points in a linear space.

Linear regression on data maped into some RKHS. arg min

θ∈H ||Y − Xθ||2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48

slide-68
SLIDE 68

Non parametric learning

2 approaches of regression problem :

Link : In general HK ⊂ L2

ρX

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

slide-69
SLIDE 69

Non parametric learning

2 approaches of regression problem :

Link : In general HK ⊂ L2

ρX

And compl||.||L2

ρX

(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression function in the RKHS.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

slide-70
SLIDE 70

Non parametric learning

2 approaches of regression problem :

Link : In general HK ⊂ L2

ρX

And compl||.||L2

ρX

(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem gρ ∈ L2

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

slide-71
SLIDE 71

Non parametric learning

2 approaches of regression problem :

Link : In general HK ⊂ L2

ρX

And compl||.||L2

ρX

(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem gρ ∈ L2 Linear regression problem in RKHS

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

slide-72
SLIDE 72

Non parametric learning

2 approaches of regression problem :

Link : In general HK ⊂ L2

ρX

And compl||.||L2

ρX

(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem gρ ∈ L2 Linear regression problem in RKHS looking for an estimator for the first problem using natural algorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

slide-73
SLIDE 73

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48

What if d >> n ? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Non parametric regression in RKHS An interesting problem itself

slide-74
SLIDE 74

Non parametric learning

SGD algorithm in the RKHS

g0 ∈ HK (we often consider g0 = 0), gn =

n

  • i=1

aiKxi, (2) (an)n such that an := −γn(gn−1(xn)−yn) = −γn n−1

i=1 aiK(xn, xi) − yi

  • .

gn = gn−1 − γn (gn−1(xn) − yn)Kxn =

n

  • i=1

aiKxi with an defined as above. (gn−1(xn) − yn)Kxn unbiased estimate of gradE[(Kx, gn−1 − y)2] . SGD algorithm in the RKHS takes very simple form

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48

slide-75
SLIDE 75

Non parametric learning

Assumptions

Two important points characterize the difficulty of the problem : The regularity of the objective function The spectrum of the covariance operator

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48

slide-76
SLIDE 76

Non parametric learning

Covariance operator

We have Σ = E [Kx ⊗ Kx] . Where Kx ⊗ Kx : g → Kx, gKx = g(x)Kx Covariance operator is a self adjoint operator which contains infor- mation on the distribution of Kx

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

slide-77
SLIDE 77

Non parametric learning

Covariance operator

We have Σ = E [Kx ⊗ Kx] . Where Kx ⊗ Kx : g → Kx, gKx = g(x)Kx Covariance operator is a self adjoint operator which contains infor- mation on the distribution of Kx Assumption : tr(Σα) < ∞, for α ∈ [0; 1].

  • n gρ : gρ ∈ Σr(L2

ρ(X)) with r ≥ 0.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

slide-78
SLIDE 78

Non parametric learning

Interpretation

Eigenvalues decrease Ellipsoid class of function. (we do not assume gρ ∈ HK)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48

slide-79
SLIDE 79

Non parametric learning

Result :

Theorem Under a few hidden assumptions : E [R (¯ gn) − R(gρ)] O σ2 tr(Σα)γα n1−α

  • + O

||Σ−rgρ||2 (nγ)2(r∧1)

  • Aymeric Dieuleveut

Stochastic optimization Hilbert spaces 36 / 48

slide-80
SLIDE 80

Non parametric learning

Result :

Theorem Under a few hidden assumptions : E [R (¯ gn) − R(gρ)] O σ2 tr(Σα)γα n1−α

  • + O

||Σ−rgρ||2 (nγ)2(r∧1)

  • Bias Variance decomposition

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

slide-81
SLIDE 81

Non parametric learning

Result :

Theorem Under a few hidden assumptions : E [R (¯ gn) − R(gρ)] O σ2 tr(Σα)γα n1−α

  • + O

||Σ−rgρ||2 (nγ)2(r∧1)

  • Bias Variance decomposition

O is a known constant (4 or 8) Finite horizon result here but extends to online setting. Saturation

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

slide-82
SLIDE 82

Non parametric learning

Corollary

Corollary Assume A1-8 : If 1−α

2

< r < 2−α

2 , with γ = n− 2r+α−1

2r+α

we get the optimal rate : E [R (¯ gn) − R(gρ)] = O

  • n−

2r 2r+α

  • (3)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48

slide-83
SLIDE 83

Non parametric learning

Conclusion 1

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates in RKHS Choice of γ

slide-84
SLIDE 84

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates in RKHS Choice of γ

slide-85
SLIDE 85

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates in RKHS Choice of γ

slide-86
SLIDE 86

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. We compare favorably to [Ying and Pontil, 2008, Caponnetto and De Vito, 2007, Tarr` es and Yao, 2011].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates in RKHS Choice of γ

slide-87
SLIDE 87

Non parametric learning

Conclusion 2

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD Adaptativity, tradeoffs.

slide-88
SLIDE 88

Non parametric learning

Conclusion 2

Theorem can be rewritten : E

  • R

¯ θn

  • − R(θ∗)
  • O

σ2 tr(Σα)γα n1−α

  • + O

θT

∗ Σ2r−1θT

(nγ)2(r∧1)

  • (4)

where the ellipsoid condition appears more clearly.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD Adaptativity, tradeoffs.

slide-89
SLIDE 89

Non parametric learning

Conclusion 2

Theorem can be rewritten : E

  • R

¯ θn

  • − R(θ∗)
  • O

σ2 tr(Σα)γα n1−α

  • + O

θT

∗ Σ2r−1θT

(nγ)2(r∧1)

  • (4)

where the ellipsoid condition appears more clearly. Thus : SGD is adaptative to the regularity of the problem bridges the gap between the different regimes and explains behaviour when d >> n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD Adaptativity, tradeoffs.

slide-90
SLIDE 90

The complexity challenge, approximation of the kernel

1

Tradeoffs of Large scale learning - Learning

2

A case study -Finite dimension linear least mean squares

3

Non parametric learning

4

The complexity challenge, approximation of the kernel

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48

slide-91
SLIDE 91

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respect

  • f the number of examples : iteration number n costs n kernel calculations.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

slide-92
SLIDE 92

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respect

  • f the number of examples : iteration number n costs n kernel calculations.

Rate Complexity Finite Dimension

d n

O(dn)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

slide-93
SLIDE 93

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respect

  • f the number of examples : iteration number n costs n kernel calculations.

Rate Complexity Finite Dimension

d n

O(dn) Infinite dimension

dn n

O(n2)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

slide-94
SLIDE 94

The complexity challenge, approximation of the kernel

2 related methods

Approximate the kernel matrix Approximate the kernel Results from [Bach, 2012]. Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., 2015] There also exist results in the second situation [Rahimi and Recht, 2008, Dai et al., 2014]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48

slide-95
SLIDE 95

The complexity challenge, approximation of the kernel

Sharp analysis

We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number dn of columns.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

slide-96
SLIDE 96

The complexity challenge, approximation of the kernel

Sharp analysis

We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number dn of columns. Then we still get the same estimation errors. Leading to : Rate Complexity Finite Dimension

d n

O(dn) Infinite dimension

dn n

O(nd2

n)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

slide-97
SLIDE 97

The complexity challenge, approximation of the kernel

Random feature selection

Many kernels may be represented, due to Bochner’s theorem as K(x, y) =

  • W

φ(w, x)φ(w, y)dµ(w). (think of translation invariant kernels and Fourier transform).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

slide-98
SLIDE 98

The complexity challenge, approximation of the kernel

Random feature selection

Many kernels may be represented, due to Bochner’s theorem as K(x, y) =

  • W

φ(w, x)φ(w, y)dµ(w). (think of translation invariant kernels and Fourier transform). We thus consider the low rank approximation : ˜ K(x, y) = 1 d

n

  • i=1

φ(x, wi)φ(y, wi). where wi ∼ µ. We use this approximation of the kernel in SGD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

slide-99
SLIDE 99

The complexity challenge, approximation of the kernel

Directions

What I am working on for the moment : Random feature selection Tuning the sampling to improve accuracy of the approximation Acceleration + stochasticity (with Nicolas Flammarion).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48

slide-100
SLIDE 100

The complexity challenge, approximation of the kernel

Some references I

Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. CoRR, abs/1411.0306. Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. ArXiv e-prints. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv e-prints. Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20. Caponnetto, A. and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3) :331–368. Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3041–3049. Dieuleveut, A. and Bach, F. (2014). Non-parametric Stochastic Approximation with Large Step sizes. ArXiv e-prints. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48

slide-101
SLIDE 101

The complexity challenge, approximation of the kernel

Some references II

Rahimi, A. and Recht, B. (2008). Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 1313–1320. Rudi, A., Camoriano, R., and Rosasco, L. (2015). Less is more : Nystr¨

  • m computational regularization.

CoRR, abs/1507.04717. Shalev-Schwartz, S. and K., S. (2011). Theorical basis for more data less work. Shalev-Schwartz, S. and Srebro, N. (2008). SVM optimisation : Inverse dependance on training set size. Proceedings of the International Conference on Machine Learning (ICML). Tarr` es, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. ArXiv e-prints 1103.5538. Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Foundations of Computational Mathematics, 5. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48

slide-102
SLIDE 102

The complexity challenge, approximation of the kernel

Thank you for your attention !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48