MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING - - PowerPoint PPT Presentation

machine learning kernels
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING - - PowerPoint PPT Presentation

MACHINE LEARNING 2012 MACHINE LEARNING MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How to separate the red class from the grey class? x 2 360 r x 1 Polar coordinates Data


slide-1
SLIDE 1

MACHINE LEARNING – 2012 1 MACHINE LEARNING

MACHINE LEARNING kernels

slide-2
SLIDE 2

MACHINE LEARNING – 2012 2 MACHINE LEARNING

r 

360

Kernels: Intuition

2

x

1

x

How to separate the red class from the grey class?

Polar coordinates Data become linearly separable

slide-3
SLIDE 3

MACHINE LEARNING – 2012 3 MACHINE LEARNING

What is ? 

r 

360

Kernels: Intuition

2

x

1

x

         

 

2 2

: , Solve for , , s.t. sin , cos

i i i i i i i i i i

x r x r x r r x          

x

i

x

 

i

x 

Assume a model (equation) for the transformation. Need at least 3 datapoints to solve.

How to separate the red class from the grey class?

slide-4
SLIDE 4

MACHINE LEARNING – 2012 4 MACHINE LEARNING

Original Space Idea: Send the data X into a feature space H through the nonlinear map .

 

 

   

 

1... 1 ,....., i M i N M

X x X x x   

  

Feature Space H

Kernels: Intuition

r 

360

 

i

x 

2

x

1

x

i

x

In feature space, computation is simpler (e.g. perform linear classification)

slide-5
SLIDE 5

MACHINE LEARNING – 2012

5

MACHINE LEARNING

In most cases, determining beforehand the transformation  may be difficult. Which representation of the data allows to classify easily the three groups of datapoints?

Kernels: Intuition

slide-6
SLIDE 6

MACHINE LEARNING – 2012

6

MACHINE LEARNING

In most cases, determining beforehand the transformation  may be difficult.

What if the groups live in N dimensions, with N>>1. Grouping may require separate sets of dimensions and can no longer be visualized

Kernels: Intuition

slide-7
SLIDE 7

MACHINE LEARNING – 2012 7 MACHINE LEARNING

While the dimension of the

  • riginal

space is N, the dimension

  • f

the feature space may be greater than N!  X is lifted onto H Determining  is difficult  Kernel Trick

Kernel-Induced Feature Space

Idea: Send the data X into a feature space H through the nonlinear map .

 

 

   

 

1... 1 ,....., i M i N M

X x X x x   

  

Original Space x2

H

slide-8
SLIDE 8

MACHINE LEARNING – 2012

8

MACHINE LEARNING

The Kernel Trick

In most cases, determining the transformation  may be difficult.  Kernel trick Key idea behind the kernel trick: Most algorithms for classification, regression

  • r

clustering compute an inner product across pairs of observations to determine the separating line, the fit

  • r

the grouping

  • f

datapoints, respectively:

Inner product across two datapoints: ,

i j

x x

slide-9
SLIDE 9

MACHINE LEARNING – 2012

9

MACHINE LEARNING

In most cases, determining the transformation  may be difficult.  No need to compute the transformation , if one expresses everything as a function of the inner product in feature space.  Proceed as follows: 1) Define a kernel function:

     

: , , .

i j i j

k X X k x x x x     

The function k can be used to determine a metric of similarity across datapoints in feature space. It can extract features that are either common or that distinguish groups of datapoints.

2) Use this transformation to perform classical classification, regression or clustering for the linear case.

The Kernel Trick

slide-10
SLIDE 10

MACHINE LEARNING – 2012 10 MACHINE LEARNING

Use of Kernels: Example

Key idea: Some problems are made simpler if you change the representation of the data Which representation of the data allows to separate linearly the two groups of datapoints?

slide-11
SLIDE 11

MACHINE LEARNING – 2012 11 MACHINE LEARNING

Use of Kernels: Example

Key idea: Some problems are made simpler if you change the representation of the data Data becomes linearly separable when projected onto two first principal component of kernel PCA with RBF kernel (see next lecture)

slide-12
SLIDE 12

MACHINE LEARNING – 2012 12 MACHINE LEARNING

Use of Kernels: Example

Key idea: Some problems are made simpler if you change the representation of the data Which representation of the data allows to classify easily the three groups of datapoints in a different cluster?

slide-13
SLIDE 13

MACHINE LEARNING – 2012 13 MACHINE LEARNING

Use of Kernels: Example

Key idea: Some problems are made simpler if you change the representation of the data Data are correctly clustered when using kernel K-means, using a RBF kernel (see lecture next week)

slide-14
SLIDE 14

MACHINE LEARNING – 2012

14

MACHINE LEARNING

Popular Kernels

 

, ' , ' , ;

p

k x x x x p  

 

2 2

' 2

, ' , .

x x

k x x e

 

 

  • Homogeneous Polynomial Kernels:
  • Gaussian / RBF Kernel (translation-invariant):
  • Inhomogeneous Polynomial Kernels:

  

, ' , ' , ,

p

k x x x x c p c    

slide-15
SLIDE 15

MACHINE LEARNING – 2012

15

MACHINE LEARNING

Kernels: Exercise I

 

     

2 2

' 2 1 1 1 2 1 2

Using the RBF kernel: , ' , , a) draw the isolines for:

  • one datapoint , i..e.: Find all , s.t.

, .

  • two datapoints, ,

: Find all , s.t. , , .

  • two datapo

x x

k x x e x x k x x cst x x x k x x k x x cst

 

    

   

1 2 1 2

ints, , : Find all , s.t. , , .

  • three datapoints

Discuss the effect of on the isolines. c) determine a metric in feature space x x x k x x k x x cst   

slide-16
SLIDE 16

MACHINE LEARNING – 2012

16

MACHINE LEARNING

RBF Kernel; M=1, i.e. 1 data point

Kernels: Solution Exercise I

slide-17
SLIDE 17

MACHINE LEARNING – 2012

17

MACHINE LEARNING

Gaussian Kernel; M=2, i.e. 2 data points

Kernels: Solution Exercise I

slide-18
SLIDE 18

MACHINE LEARNING – 2012

18

MACHINE LEARNING

Gaussian Kernel; M=2, i.e. 2 data points

Kernels: Solution Exercise I

Small kernel width Large kernel width

slide-19
SLIDE 19

MACHINE LEARNING – 2012

19

MACHINE LEARNING

Gaussian Kernel; M=3, i.e. 3 data points

Kernels: Solution Exercise I

slide-20
SLIDE 20

MACHINE LEARNING – 2012

20

MACHINE LEARNING

Gaussian Kernel; M=3, i.e. 3 data points

Kernels: Solution Exercise I

slide-21
SLIDE 21

MACHINE LEARNING – 2012

21

MACHINE LEARNING

Gaussian Kernel; M=3, i.e. 3 data points

Kernels: Solution Exercise I

slide-22
SLIDE 22

MACHINE LEARNING – 2012

22

MACHINE LEARNING

Kernels: Exercise II  

Using the homogeneous polynomial kernel: , ' , ' , , draw the isolines as in previous exercise for: a) one datapoint b) two datapoints c) three datapoints Discuss the effect of on the isolines.

p

k x x x x p p  

slide-23
SLIDE 23

MACHINE LEARNING – 2012

23

MACHINE LEARNING

Polynomial Kernel; order p=1, 2, 3; M=1, i.e. 1 data points

Kernels: Solution Exercise II

p=1 p=2 p=3

The isolines are lines perpendicular to the vector point from the origin. The order p does not change the geometry and only changes the values of the isolines.

slide-24
SLIDE 24

MACHINE LEARNING – 2012

24

MACHINE LEARNING

Polynomial Kernel; order p=1, 2, 3; M=2, i.e. 2 data points

Kernels: Solution Exercise II

p=1 p=2 p=3

The isolines are lines perpendicular to the combination of the vector points for p=1. With p=2, we have an ellipse. With p=3, we have hyperbolas. P>3 are similar in concept with change in signs of the values of the isolines depending on whether we have odd or even p.

slide-25
SLIDE 25

MACHINE LEARNING – 2012

25

MACHINE LEARNING

Polynomial Kernel; order p=1, 2, 3; M=3, i.e. 3 data points

Kernels: Solution Exercise II

Solutions with p>1 present a symmetry around the origin.

slide-26
SLIDE 26

MACHINE LEARNING – 2012

26

MACHINE LEARNING

Kernels: Exercise III

 

Another also relatively popular kernel is the kernel: , ' '. 1) Can you tell what this kernel measures? 2) Find an application where using the kernel provides an in linea teresting measure r lin . ear

T

k x x x x 

slide-27
SLIDE 27

MACHINE LEARNING – 2012

27

MACHINE LEARNING

Kernels: Solution Exercise III

1000

Bags of words: [machine, learning, kernel, rbf, robot, vision, dimension, blue, speed,...] You want to group webpages with common groups of words. Set with each entry in set to 1 if the word i x x 

 

1

s present else zero. E.g. 1,1,1,0,0,0,.... contains the words machine learning and kernel and nothing else. Features live in low-dimensional space (common group of webpages have a low number of combin x 

   

1 1 2 2 3 3 4 4 1 1

ation of words): , ... The isoline , 3 delineate the set of webpages that share the same set of three keywords as .

i j T T T T T k k k j

k x x x x x x x x x x x x k x x x       

slide-28
SLIDE 28

MACHINE LEARNING – 2012

28

MACHINE LEARNING

Kernels: Solution Exercise III

 

1000

Sequence of strings (e.g genetic code): [IPTS VBUV,...] Want to group strings with common subgroups of strings. Set , the number of times sub-string appears in the string word. Apply same e L r QD x x x   asoning as before for grouping.

slide-29
SLIDE 29

MACHINE LEARNING – 2012

29

MACHINE LEARNING

How to choose kernels?

  • There is no rule for choosing the right kernel; each kernel

must be adapted to a particular problem.

  • There exist now methods to learn the kernel (it is one of

the proposed topics for literature survey)

  • Kernel design can start from the desired feature space or

from data.

  • Some considerations are important:

– Use the kernel to introduce apriori (domain) knowledge – Be sure to keep some structure in the feature space

slide-30
SLIDE 30

MACHINE LEARNING – 2012

31

MACHINE LEARNING

Example: The Kernel Trick in Regression

 

; ,

T

y f x w b w x b   

y x

b

: search a linear mapping between input and output , by Linear regression parametrized slope vect the and the intercept

  • .

r

N N

w x y b   

 

*

Omit by centering the data: ' and ' , , : mean on and ' ' ' with ' Least-square estimate of ' ' ' ' '.

T T T T

b y y y x x x x y x y y w x b b b w x y b y w x y w x              

slide-31
SLIDE 31

MACHINE LEARNING – 2012

32

MACHINE LEARNING

The Kernel Trick in Regression

 

;

T

y f x w w x  

: search a linear mapping between input and output Linear regression parametriz , by the ed slope vecto . r

N N

w x y   

y x

slide-32
SLIDE 32

MACHINE LEARNING – 2012

33

MACHINE LEARNING

The Kernel Trick in Regression

y x

   

The problem consists of minimizing the loss function: , ; , Find such that , ; L x y w y w x w L x y w    

slide-33
SLIDE 33

MACHINE LEARNING – 2012

34

MACHINE LEARNING

The Kernel Trick in Regression

y x

1 2 1 2

Pair of training points [ ... ] and [ ... ] , .

M M T i N i

M X x x x y y y x y     y

 

1

, ; ,

M i i i i

L x y w y w x

 

i

j

slide-34
SLIDE 34

MACHINE LEARNING – 2012

35

MACHINE LEARNING

The Kernel Trick in Regression

 

     

2 1 1 2

1 1 X, ; 2 2 [ .... ] , : matrix with column vectors

i i

M T T T T i M T i

L w y w x X w X w y y y X x

     

y y y y

   

*

Optimal given by: 1 min 2

T T T w

w w X w X w          y y

   

1 *

has an analytical solution: =

T T T T

X X w X w XX X XX w

       y y y

slide-35
SLIDE 35

MACHINE LEARNING – 2012

36

MACHINE LEARNING

The Kernel Trick in Regression

 

1 *

has an analytical solution: =

T

w XX X

y requires datapoints at minimum to solve Generally not too computationally intensive as .

N

w N N M   

 

3

Requires

  • perations!

O N It has an exact solution if: a) is not singular (it is singular with not enough datapoints) b) Data is not noisy (otherwise no single match to , )

T i i

XX y w x 

slide-36
SLIDE 36

MACHINE LEARNING – 2012

37

MACHINE LEARNING

 

1 *

has an analytical solution: =

T

w XX X

y

The Kernel Trick in Regression

 

1

pseudo-inverse If is singular, two solutions: a) Approximate with (minimize norm) b) Tradeoff size of norm against los Ridge Re s g ( ress ) ion

T T

XX XX

slide-37
SLIDE 37

MACHINE LEARNING – 2012

38

MACHINE LEARNING

The Kernel Trick in Regression

 

   

 

Ridge Regression: 1 X, ; , 2

min min

T T T T w w

L w X w X w w w              y y y

Regularization Term Introduces penalty for large weights Reduces number of solutions 

 

* 1

Take derivative for :

T T

w X XX w w w X XX I  

      y y

Always invertible for  

 

3

Complexity still O N

slide-38
SLIDE 38

MACHINE LEARNING – 2012

39

MACHINE LEARNING

The Kernel Trick in Regression

   

1 : This solution is called t

a he Du l

T T

X X X I X    

     y y

: Gram matrix , : number datapoints

T

G XX M M M  

 

3

Complexity O M

   

* 1 1 T T

w X X w w X X w    

 

       y y Take derivative for :

T

w X XX w w     y

 

   

 

Ridge Regression: 1 X, ; , 2

min min

T T T T w w

L w X w X w w w              y y y

slide-39
SLIDE 39

MACHINE LEARNING – 2012

40

MACHINE LEARNING

The Kernel Trick in Regression

: Gram matrix , : number datapoints

T

G XX M M M  

 

3

Complexity O M

 

3

Complexity O N

If , solve with the dual. If , solve with the primal. N M M N  

 

* 1

Take derivative for :

T T

w X XX w w w X XX I  

      y y

   

1 : This solution is called t

a he Du l

T T

X X X I X    

     y y

 

   

 

Ridge Regression: 1 X, ; , 2

min min

T T T T w w

L w X w X w w w              y y y

slide-40
SLIDE 40

MACHINE LEARNING – 2012

41

MACHINE LEARNING

The Kernel Trick in Regression

The Gram matrix, , is composed of between training points: , use kernel trick to transfer to non-linear regress inner produ t i c

  • n

s

T i j ij

G XX G x x   

   

1 : This solution is called t

a he Du l

T T

X X X I X    

     y y

: Gram matrix , : number datapoints

T

G XX M M M  

slide-41
SLIDE 41

MACHINE LEARNING – 2012

42

MACHINE LEARNING

The Kernel Trick in Regression

y x

   

Estimate a non-linear function ; Assume that there exists a non-linear transformation, such that the problem becomes linear. , s.t. , . y f x w y x w     

     

 

 

 

1 *

If true, solution with ridge regression is: , columns of are

T i

w X X X I X x  

      y

 

     

 

1 1

Predicted output for a query poin , t is:

T i M i

x x X X I x y   

 

    y

slide-42
SLIDE 42

MACHINE LEARNING – 2012

43

MACHINE LEARNING

The Kernel Trick in Regression

     

Replace all between training points by kernel function : , , . The kernel function is easier to compute and does not require to kno inner product w s .

i j i j

k X X k x x x x      

 

     

 

1 1

Predicted output for a query poin , t is:

T i M i

x x X X I x y   

 

    y

 

 

Gram Matrix in feature space 1 1

Predicted output for a query point become , s: ,

M i i

k x y x x K X I X 

 

             

y

slide-43
SLIDE 43

MACHINE LEARNING – 2012

44

MACHINE LEARNING

The Kernel Trick in Regression

See matlab example in supplementary documentation

slide-44
SLIDE 44

MACHINE LEARNING – 2012

45

MACHINE LEARNING

Kernels: Summary

  • Kernels are continuous positive, symmetric functions. The

most popular is the RBF kernel. It enables to group

  • datapoints. The tightness of the grouping depends on the

value of the hyperparameter sigma.

  • Kernels represent metrics in feature space (inner product of

projection of data in feature space)

  • Kernels in ML are useful because they allow to reduce

computation of complex non-linear problem to simpler forms applied to linear problems.