9.54 class 8 Supervised learning Optimization, regularization, - - PowerPoint PPT Presentation

9 54 class 8
SMART_READER_LITE
LIVE PREVIEW

9.54 class 8 Supervised learning Optimization, regularization, - - PowerPoint PPT Presentation

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014 The Regularization Kingdom Loss functions and empirical risk


slide-1
SLIDE 1

9.54, fall semester 2014

9.54 class 8

Shimon Ullman + Tomaso Poggio

Danny Harari + Daneil Zysman + Darren Seibert

Supervised learning Optimization, regularization, kernels

slide-2
SLIDE 2

The Regularization Kingdom

  • Loss functions and empirical risk minimization
  • Basic regularization algorithms
slide-3
SLIDE 3

Math

slide-4
SLIDE 4

Given a Training Set

f(x) ∼ y

Find

S = (x1, y1), . . . , (xn, yn)

slide-5
SLIDE 5

We need a way to measure errors Loss function

V (f(x), y)

slide-6
SLIDE 6
  • 0 − 1-loss V (f(x), y) = ✓(−yf(x)) (✓ is the step function)
  • square loss (L2) V (f(x), y) = (f(x) − y)2 = (1 − yf(x))2
  • absolute value (L1) V (f(x), y) = |f(x) − y|
  • Vapnik’s ✏-insensitive loss V (f(x), y) = (|f(x) − y| − ✏)+
  • hinge loss V (f(x), y) = (1 − yf(x))+
  • logistic loss V (f(x), y) = log(1 − e−yf(x)) logistic regression
  • exponential loss V (f(x), y) = e−yf(x)
slide-7
SLIDE 7

IS[f] = 1

n

Pn

i=1 V (f(xi), yi)

Given a loss function V (f(x), y) We can define the Empirical Error

slide-8
SLIDE 8

``Learning processes do not take place in vacuum.’’

Cucker and Smale, AMS 2001

We need to fix a Hypotheses Space

H ⊂ F = {f | f : X → Y } F H

slide-9
SLIDE 9
  • Linear model f(x) = Pp

j=1 xjwj

  • Generalized linear models f(x) = Pp

j=1 Φ(x)jwj

  • Reproducing kernel Hilbert spaces f(x) = P

j1 Φ(x)jwj = P i1 K(x, xi)αi

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric non-parametric

F H H ⊂ F = {f | f : X → Y }

slide-10
SLIDE 10
  • Linear model f(x) = Pp

j=1 xjwj

  • Generalized linear models f(x) = Pp

j=1 Φ(x)jwj

  • Reproducing kernel Hilbert spaces f(x) = P

j1 Φ(x)jwj = P i1 K(x, xi)αi

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric semi-parametric

F H H ⊂ F = {f | f : X → Y }

slide-11
SLIDE 11
  • Linear model f(x) = Pp

j=1 xjwj

  • Generalized linear models f(x) = Pp

j=1 Φ(x)jwj

  • Reproducing kernel Hilbert spaces f(x) = P

j1 Φ(x)jwj = P i1 K(x, xi)αi

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric semi-parametric non-parametric

F H H ⊂ F = {f | f : X → Y }

slide-12
SLIDE 12

Empirical Risk Minimization (ERM)

min

f∈H IS[f] = min f∈H

1 n

n

X

i=1

V (f(xi), yi)

slide-13
SLIDE 13

Empirical Risk Minimization (ERM)

min

f∈H IS[f] = min f∈H

1 n

n

X

i=1

V (f(xi), yi)

slide-14
SLIDE 14

Empirical Risk Minimization (ERM) Which is a good solution?

min

f∈H ES[f] = min f∈H

1 n

n

X

i=1

V (f(xi), yi)

slide-15
SLIDE 15

Training set

house.) (living area of

Learning algorithm h predicted y x

(predicted price)

  • f house)

S = (x1, y1), . . . , (xn, yn)

The training set is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p(x, y) = p(x)p(y|x)

slide-16
SLIDE 16

Ill posed problems often arise if one tries to infer general laws from few data the hypothesis space is too large there are not enough data

In general ERM leads to ill-posed solutions because the solution may be too complex it may be not unique it may change radically when leaving one sample out

Learning is an ill-posed problem

Jacques Hadamard

Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization)

slide-17
SLIDE 17
  • Beyond drawings & intuitions (...) there is a deep, rigorous

mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ).

  • Theory of learning is a synthesis of different fields, e.g. Computer Science

(Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics).

  • Central to the Theory of Machine Learning is the problem of

understanding condition under which ERM can solve

inf E(f), E(f) = E(x,y) V (y, f(x))

slide-18
SLIDE 18

Algorithms: The Regularization Kingdom

  • loss functions and empirical risk minimization
  • basic regularization algorithms
slide-19
SLIDE 19

(Tikhonov) Regularization

min

f∈H{ 1

n

n

X

i=1

V (yi, f(xi)) + λR(f))} → f λ

S

regularization parameter regularizer

  • The regularizer describes the complexity of the solution

R(f2) is bigger than R(f1) f1 f2

  • The regularization parameter determines the trade-off

between complexity and empirical risk

slide-20
SLIDE 20

min

f∈H

1 n

n

X

i=1

(yi − f(xi))2

Stability and (Tikhonov) Regularization

Consider f(x) = wT x = Pp

j=1 wjxj, and R(f) = wT w,

min

f∈H

( 1 n

n

X

i=1

(yi f(xi))2 + λkfk2 ) Math

wT = Y XT (XXT + λI)−1

wT = Y XT (XXT )−1

slide-21
SLIDE 21

From Linear to Semi-parametric Models

If instead of a linear model we have a generalized linear model we simply have to consider

Xn =    Φ(x1)1 . . . . . . . . . Φ(x1)p . . . . . . . . . . . . . . . Φ(xn)1 . . . . . . . . . Φ(xn)p    f(x) =

p

X

j=1

xjwj | {z }

linear model

= ⇒ f(x) =

p

X

j=1

Φ(x)jwj | {z }

generalized linear model

slide-22
SLIDE 22

From Parametric to Nonparametric Models

Then Some simple linear algebra shows that

We can compute Cn or wn depending whether n ≤ p.

The above result is the most basic form of the Representer Theorem. How about nonparametric models?

Math

wT = Y XT (XXT )−1 = Y (XT X)−1XT = CXT since XT (XXT )−1 = (XT X)−1XT f(x) = wT x = CXT x =

n

X

i

cixT

i x

slide-23
SLIDE 23

From Linear to Nonparametric Models

f(x) =

p

X

j=1

wj

nxj = n

X

i=1

xT

i x

|{z}

Pp

j=1 xj i xj

ci

Note that We can now consider a truly non parametric model

f(x) = X

j≥1

wjΦ(x)j =

n

X

i=1

K(x, xi) | {z } ci

X

j≥1

Φ(xi)jΦ(x)j

Math

slide-24
SLIDE 24

From Linear to Nonparametric Models

We have

Cn = (XnXT

n

| {z } +λnI)−1Yn (XnXT

n )i,j = xT i xj

Cn = ( Kn |{z} +λnI)−1Yn (Kn)i,j = K(xi, xj)

We can now consider a truly non parametric model

f(x) = X

j≥1

wjΦ(x)j =

n

X

i=1

K(x, xi) | {z } ci

X

j≥1

Φ(xi)jΦ(x)j

Math

slide-25
SLIDE 25

Kernels

  • Linear kernel

K(x, x0) = xT x0

  • Gaussian kernel

K(x, x0) = e kxx0k2

σ2

, σ > 0

  • Polynomial kernel

K(x, x0) = (xT x0 + 1)d, d ∈ N

  • Inner Product kernel/Features

K(x, x0) =

p

X

j=1

Φ(x)jΦ(x0)j Φ : X → Rp.

slide-26
SLIDE 26

Reproducing Kernel Hilbert Spaces

Note: An RKHS is equivalently defined as a Hilbert space where the evaluation functionals are continuous.

Given K, 9! Hilbert space of functions (H, h·, ·i) such that,

  • Kx := K(x, ·) 2 H, for all x 2 X, and
  • f(x) = hf, Kxi, for all x 2 X, f 2 H.

The norm of a function f(x) = Pn

i=1 K(x, xi)ci is given by

kfk2 =

n

X

i,j=1

K(xj, xi)cicj and is a natural complexity measure.

Math

slide-27
SLIDE 27

Extensions: Other Loss Functions

  • V (f(x), y) = (f(x) − y)2, RLS
  • V (f(x), y) = (|f(x) − y| − ✏)+ SVM regression
  • V (f(x), y) = (1 − yf(x))+ SVM classification
  • V (f(x), y) = log(1 − e−yf(x)) logistic regression
  • V (f(x), y) = e−yf(x) boosting

For most loss functions the solution of Tikhonov regularization is of the form

f(x) =

n

X

i=1

K(x, xi)ci.

slide-28
SLIDE 28

Extensions: Other Loss Functions (cont)

By changing the loss function we change the way we compute the coefficients in expansion

f(x) =

n

X

i=1

K(x, xi)ci.

slide-29
SLIDE 29
  • Regularization avoids overfitting, ensures stability of the

solution and generalization

  • There are many different instance of regularization beyond

Tikhonov, e.g. early stopping...

min

f

IS[f] | {z }

data fit term

+λ R(f) | {z }

complexity/smoothness term

slide-30
SLIDE 30
  • Regularization ensures

stability of the solution and generalization

  • There are different instance of regularization beyond

Tikhonov, e.g. early stopping

slide-31
SLIDE 31

Conclusions

  • Regularization Theory provides results and techniques to

avoid overfitting (stability is key to generalization)

  • Regularization provide a core set of concepts and

techniques to solve a variety of problems

  • Most algorithms can be seen as a form of regularization
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67

Hebbian mechanisms can be used for biological supervised learning (Knudsen, 1990)