Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online - - PowerPoint PPT Presentation

online learning
SMART_READER_LITE
LIVE PREVIEW

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online - - PowerPoint PPT Presentation

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal To introduce theory and algorithms for online learning. L. Rosasco Online Learning Plan Different views on online learning From batch to online


slide-1
SLIDE 1

Online Learning

Lorenzo Rosasco

MIT, 9.520

  • L. Rosasco

Online Learning

slide-2
SLIDE 2

About this class

Goal To introduce theory and algorithms for online learning.

  • L. Rosasco

Online Learning

slide-3
SLIDE 3

Plan

Different views on online learning From batch to online least squares Other loss functions Theory

  • L. Rosasco

Online Learning

slide-4
SLIDE 4

(Batch) Learning Algorithms

A learning algorithm A is a map from the data space into the hypothesis space and fS = A(S), where S = Sn = (x0, y0). . . . (xn−1, yn−1). We typically assume that: A is deterministic, A does not depend on the ordering of the points in the training set. notation: note the weird numbering of the training set!

  • L. Rosasco

Online Learning

slide-5
SLIDE 5

Online Learning Algorithms

The pure online learning approach is: let f1 = init for n = 1, . . . fn+1 = A(fn, (xn, yn)) The algorithm works sequentially and has a recursive definition.

  • L. Rosasco

Online Learning

slide-6
SLIDE 6

Online Learning Algorithms (cont.)

A related approach (similar to transductive learning) is: let f1 = init for n = 1, . . . fn+1 = A(fn, Sn, (xn, yn)) Also in this case the algorithm works sequentially and has a recursive definition, but it requires storing the past data Sn.

  • L. Rosasco

Online Learning

slide-7
SLIDE 7

Why Online Learning?

Different motivations/perspectives that often corresponds to different theoretical framework. Biologically plausibility. Stochastic approximation. Incremental Optimization. Non iid data, game theoretic view.

  • L. Rosasco

Online Learning

slide-8
SLIDE 8

Online Learning and Stochastic Approximation

Our goal is to minimize the expected risk I[f] = E(x,y) [V(f(x), y)] =

  • V(f(x), y)dµ(x, y)
  • ver the hypothesis space H, but the data distribution is not

known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data.

  • L. Rosasco

Online Learning

slide-9
SLIDE 9

Online Learning and Stochastic Approximation

Our goal is to minimize the expected risk I[f] = E(x,y) [V(f(x), y)] =

  • V(f(x), y)dµ(x, y)
  • ver the hypothesis space H, but the data distribution is not

known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data.

  • L. Rosasco

Online Learning

slide-10
SLIDE 10

Online Learning and Stochastic Approximation (cont.)

More precisely if we are given samples (xi, yi)i in a sequential fashion at the n − th step we have an approximation G(f, (xn, yn))

  • f the gradient of I[f]

then we can define a recursion by let f0 = init for n = 0, . . . fn+1 = fn + γn(G(fn, (xn, yn))

  • L. Rosasco

Online Learning

slide-11
SLIDE 11

Online Learning and Stochastic Approximation (cont.)

More precisely if we are given samples (xi, yi)i in a sequential fashion at the n − th step we have an approximation G(f, (xn, yn))

  • f the gradient of I[f]

then we can define a recursion by let f0 = init for n = 0, . . . fn+1 = fn + γn(G(fn, (xn, yn))

  • L. Rosasco

Online Learning

slide-12
SLIDE 12

Incremental Optimization

Here our goal is to solve empirical risk minimization IS[f],

  • r regularized empirical risk minimization

S[f] = IS[f] + λf2

  • ver the hypothesis space H, when the number of points is so

big (say n = 108 − 109) that standard solvers would not be feasible. Memory is the main constraint here.

  • L. Rosasco

Online Learning

slide-13
SLIDE 13

Incremental Optimization (cont.)

In this case we can consider let f0 = init for t = 0, . . . ft+1 = ft + γt(G(ft, (xnt, ynt)) where here G(ft, (xnt, ynt)) is a pointwise estimate of the gradient of IS or Iλ

S.

Epochs Note that in this case the number of iteration is decoupled to the index of training set points and we can look at the data more than once, that is consider different epochs.

  • L. Rosasco

Online Learning

slide-14
SLIDE 14

Non i.i.d. data, game theoretic view

If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples (xi, yi) at each step a learner chooses an estimator fi+1. The goal of the learner is to perform as well as if he could view the whole sequence.

  • L. Rosasco

Online Learning

slide-15
SLIDE 15

Non i.i.d. data, game theoretic view

If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples (xi, yi) at each step a learner chooses an estimator fi+1. The goal of the learner is to perform as well as if he could view the whole sequence.

  • L. Rosasco

Online Learning

slide-16
SLIDE 16

Plan

Different views on online learning From batch to online least squares Other loss functions Theory

  • L. Rosasco

Online Learning

slide-17
SLIDE 17

Recalling Least Squares

We start considering a linear kernel so that IS[f] = 1 n

n−1

  • i=0

(yi − xT

i w) = Y − Xw2

Remember that in this case wn = (X TX)−1X TY = C−1

n n−1

  • i=0

xiyi. (Note that if we regularize we have (Cn + λI)−1 in place of C−1

n .

notation: note the weird numbering of the training set!

  • L. Rosasco

Online Learning

slide-18
SLIDE 18

Recalling Least Squares

We start considering a linear kernel so that IS[f] = 1 n

n−1

  • i=0

(yi − xT

i w) = Y − Xw2

Remember that in this case wn = (X TX)−1X TY = C−1

n n−1

  • i=0

xiyi. (Note that if we regularize we have (Cn + λI)−1 in place of C−1

n .

notation: note the weird numbering of the training set!

  • L. Rosasco

Online Learning

slide-19
SLIDE 19

A Recursive Least Squares Algorithm

Then we can consider wn+1 = wn + C−1

n+1xn[yn − xT n wn].

Proof wn = C−1

n (n−1 i=0 xiyi)

wn+1 = C−1

n+1(n−1 i=0 xiyi + xnyn)

wn+1 − wn = C−1

n+1(xnyn) + C−1 n+1(Cn − Cn+1)C−1 n

n−1

i=0 xiyi

Cn+1 − Cn = xnxT

n .

  • L. Rosasco

Online Learning

slide-20
SLIDE 20

A Recursive Least Squares Algorithm

Then we can consider wn+1 = wn + C−1

n+1xn[yn − xT n wn].

Proof wn = C−1

n (n−1 i=0 xiyi)

wn+1 = C−1

n+1(n−1 i=0 xiyi + xnyn)

wn+1 − wn = C−1

n+1(xnyn) + C−1 n+1(Cn − Cn+1)C−1 n

n−1

i=0 xiyi

Cn+1 − Cn = xnxT

n .

  • L. Rosasco

Online Learning

slide-21
SLIDE 21

A Recursive Least Squares Algorithm

Then we can consider wn+1 = wn + C−1

n+1xn[yn − xT n wn].

Proof wn = C−1

n (n−1 i=0 xiyi)

wn+1 = C−1

n+1(n−1 i=0 xiyi + xnyn)

wn+1 − wn = C−1

n+1(xnyn) + C−1 n+1(Cn − Cn+1)C−1 n

n−1

i=0 xiyi

Cn+1 − Cn = xnxT

n .

  • L. Rosasco

Online Learning

slide-22
SLIDE 22

A Recursive Least Squares Algorithm

Then we can consider wn+1 = wn + C−1

n+1xn[yn − xT n wn].

Proof wn = C−1

n (n−1 i=0 xiyi)

wn+1 = C−1

n+1(n−1 i=0 xiyi + xnyn)

wn+1 − wn = C−1

n+1(xnyn) + C−1 n+1(Cn − Cn+1)C−1 n

n−1

i=0 xiyi

Cn+1 − Cn = xnxT

n .

  • L. Rosasco

Online Learning

slide-23
SLIDE 23

A Recursive Least Squares Algorithm (cont.)

We derived the algorithm wn+1 = wn + C−1

n+1xn[yn − xT n wn].

The above approach is recursive; requires storing all the data; requires inverting a matrix (Ci)i at each step.

  • L. Rosasco

Online Learning

slide-24
SLIDE 24

A Recursive Least Squares Algorithm (cont.)

The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [A + BCD]−1 = A−1 − A−1B[DA−1B + C−1]−1DA−1 Then C−1

n+1 = C−1 n

− C−1

n xnxT n C−1 n

1 + xT

n C−1 n xn

.

  • L. Rosasco

Online Learning

slide-25
SLIDE 25

A Recursive Least Squares Algorithm (cont.)

The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [A + BCD]−1 = A−1 − A−1B[DA−1B + C−1]−1DA−1 Then C−1

n+1 = C−1 n

− C−1

n xnxT n C−1 n

1 + xT

n C−1 n xn

.

  • L. Rosasco

Online Learning

slide-26
SLIDE 26

A Recursive Least Squares Algorithm (cont.)

Moreover C−1

n+1xn = C−1 n xn − C−1 n xnxT n C−1 n

1 + xT

n C−1 n xn

xn = C−1

n

1 + xT

n C−1 n xn

xn, we can derive the algorithm wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence – as n → ∞ – are the same as those for ERM.

  • L. Rosasco

Online Learning

slide-27
SLIDE 27

A Recursive Least Squares Algorithm (cont.)

Moreover C−1

n+1xn = C−1 n xn − C−1 n xnxT n C−1 n

1 + xT

n C−1 n xn

xn = C−1

n

1 + xT

n C−1 n xn

xn, we can derive the algorithm wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence – as n → ∞ – are the same as those for ERM.

  • L. Rosasco

Online Learning

slide-28
SLIDE 28

A Recursive Least Squares Algorithm (cont.)

Moreover C−1

n+1xn = C−1 n xn − C−1 n xnxT n C−1 n

1 + xT

n C−1 n xn

xn = C−1

n

1 + xT

n C−1 n xn

xn, we can derive the algorithm wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence – as n → ∞ – are the same as those for ERM.

  • L. Rosasco

Online Learning

slide-29
SLIDE 29

A Recursive Least Squares Algorithm (cont.)

The algorithm wn+1 = wn−1 + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

is of the form let f0 = init for n = 1, . . . fn+1 = A(fn, Sn, (xn, yn)). The above approach is recursive; requires storing all the data; updates the inverse just via matrix/vector multiplication.

  • L. Rosasco

Online Learning

slide-30
SLIDE 30

A Recursive Least Squares Algorithm (cont.)

The algorithm wn+1 = wn−1 + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

is of the form let f0 = init for n = 1, . . . fn+1 = A(fn, Sn, (xn, yn)). The above approach is recursive; requires storing all the data; updates the inverse just via matrix/vector multiplication.

  • L. Rosasco

Online Learning

slide-31
SLIDE 31

Recursive Least Squares (cont.)

How about the memory requirement? The main constraint comes from the matrix storing/inversion wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

We can consider wn+1 = wn + γnxn[yn − xT

n wn].

where γt is a decreasing sequence.

  • L. Rosasco

Online Learning

slide-32
SLIDE 32

Recursive Least Squares (cont.)

How about the memory requirement? The main constraint comes from the matrix storing/inversion wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

We can consider wn+1 = wn + γnxn[yn − xT

n wn].

where γt is a decreasing sequence.

  • L. Rosasco

Online Learning

slide-33
SLIDE 33

Recursive Least Squares (cont.)

How about the memory requirement? The main constraint comes from the matrix storing/inversion wn+1 = wn + C−1

n

1 + xT

n C−1 n xn

xn[yn − xT

n wn].

We can consider wn+1 = wn + γnxn[yn − xT

n wn].

where γt is a decreasing sequence.

  • L. Rosasco

Online Learning

slide-34
SLIDE 34

Online Least Squares

The algorithm wn+1 = wn + γnxn[yn − xT

n wn].

is recursive; does not requires storing the data; does not require updating the inverse, but only vector/vector multiplication.

  • L. Rosasco

Online Learning

slide-35
SLIDE 35

Convergence of Online Least Squares (cont.)

When we consider wn+1 = wn + γnxn[yn − xT

n wn].

we are no longer guaranteed convergence. In other words: how shall we choose γn?

  • L. Rosasco

Online Learning

slide-36
SLIDE 36

Batch vs Online Gradient Descent

Some ideas from the comparison with batch least squares. wn+1 = wn + γnxn[yn − xT

n wn].

Note that ∇(yn − xT

n w)2 = xn[yn − xT n w].

The batch gradient algorithm would be wn+1 = wn + γn 1 t

t−1

  • t=0

xt[yt − xT

t wn].

since ∇I(w) = ∇1 t

t−1

  • t=0

(yt − xT

t w)2 = 1

t

t−1

  • t=0

xt(yt − xT

t w)

  • L. Rosasco

Online Learning

slide-37
SLIDE 37

Batch vs Online Gradient Descent

Some ideas from the comparison with batch least squares. wn+1 = wn + γnxn[yn − xT

n wn].

Note that ∇(yn − xT

n w)2 = xn[yn − xT n w].

The batch gradient algorithm would be wn+1 = wn + γn 1 t

t−1

  • t=0

xt[yt − xT

t wn].

since ∇I(w) = ∇1 t

t−1

  • t=0

(yt − xT

t w)2 = 1

t

t−1

  • t=0

xt(yt − xT

t w)

  • L. Rosasco

Online Learning

slide-38
SLIDE 38

Convergence of Recursive Least Squares (cont.)

If γn decreases too fast the iterate will get stuck away from the real solution. In general the convergence of the online iteratatoin is slower than the recursive least squares since we use a simpler step-size. Polyak Averging If we choose γn = n−α, with α ∈ (1/2, 1) and use, at each step, the solution obtained averaging all previous solutions (namely Polyak averaging), then convergence is ensured and is almost the same as recursive least squares.

  • L. Rosasco

Online Learning

slide-39
SLIDE 39

Extensions

Other Loss functions Other Kernels.

  • L. Rosasco

Online Learning

slide-40
SLIDE 40

Other Loss functions

The same ideas can be extended to other convex differentiable loss functions, considering wn+1 = wn + γn∇V(yn, xT

n wn).

If the loss is convex but not differentiable then more sophisticated methods must be used, e.g. proximal methods, subgradient methods.

  • L. Rosasco

Online Learning

slide-41
SLIDE 41

Non linear Kernels

If we use different kernels f(x) = Φ(x)Tw = n−1

i=0 ciK(x, xi)

then we can consider the iteration ci

n+1 = ci n,

i = 1, . . . , n cn+1

n+1 = γn[yn − cT n Kn(xn)]),

where Kn(x) = K(x, x0), . . . , K(x, xn−1) and cn = (c1

n, . . . , cn n).

  • L. Rosasco

Online Learning

slide-42
SLIDE 42

Non linear Kernels (cont.)

Proof fn(x) = wT

n Φ(x) = Kn(x)Tcn.

wn+1 = wn + γnΦ(xn)[yn − Φ(xn)Twn], gives, for all x, fn+1(x) = fn(x) + γn[yn − Kn(xn)Tcn]K(x, xn), that can be written as Kn+1(x)Tcn+1 = Kn(x)Tcn + γn[yn − Kn(xn)Tcn]K(x, xn). QED

  • L. Rosasco

Online Learning

slide-43
SLIDE 43

Non linear Kernels

ci

n+1 = ci n,

i = 1, . . . , n cn+1

n+1 = γn[yn − cT n Kn(xn)]).

The above approach is recursive; requires storing all the data; does not require updating the inverse, but only vector/vector multiplication – O(n).

  • L. Rosasco

Online Learning