Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban - - PowerPoint PPT Presentation

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Motivations Kernel Definition Mercers Theorem Kernel Matrix Kernel Construction


slide-1
SLIDE 1

Machine Learning

Kernel Methods

Hamid R. Rabiee

Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

slide-2
SLIDE 2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Motivations  Kernel Definition  Mercer’s Theorem  Kernel Matrix  Kernel Construction

slide-3
SLIDE 3

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Mot Motiv ivati ations

  • ns

 Learning linear classifiers can be done effectively (SVM, Perceptron, …).

 How to generalize existing efficient linear classifiers to non-linear ones.

 It may be hard to classify data points in the original feature space.

 Use an appropriate high dimensional non-linear map to change the feature space.

slide-4
SLIDE 4

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Kernel ernel Def Defini initi tion

  • n

 Consider data x lying in Rn .  Use a high dimensional mapping Ф: Rn  RN , with N>n.  Define the kernel function K(x,x’)=Ф(x)T Ф(x’).

 That is the kernel function is the dot product in the new feature space.  Dot product measures the similarity of two data points.  K(x,x’) shows the similarity of x and x’.  It is efficient to use K instead of Ф if the dimensionality of Ф is high (Why?).

slide-5
SLIDE 5

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Kernel ernel Def Defini initi tion

  • n

 A simple example:

 Consider x = (x1, x2) lies in 2 dimensional plane and Ф : R2  R3 with the following definition  A linear classifier in new space will become (w’ is a vector in new space):  What will be the shape of separating curve in the original space?

2 2 1 2 1 2 3 1 1 2 2

( , ) ( , , ) ( , 2 , ) x x z z z x x x x   

2 2 1 1 2 1 2 3 2

( ) ' ' ' ' ( ) ' ' 2 ' ' '

T T

g x w x w w x w w x w x x w x w         

2 2 1 1 2 1 2 3 2

' 2 ' ' ' w x w x x w x w    

slide-6
SLIDE 6

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Kernel ernel Def Defini initi tion

  • n

 What will be the kernel function in the previous example? The dot product in the new space is squared of the dot product in the original space.  Can we construct an arbitrary conic section in original feature space? Why? We instead use

        

 

2 2 1 1 1 2 1 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2

( , ) ( ) ( ) 2 2 2

T T T

u v K u v u v u u v v u v u v u v u v u v u v u v u v                                  

2

( 1)

T

u v

slide-7
SLIDE 7

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Kernel ernel Def Defini initi tion

  • n

 Some typical kernels include :

 Linear  Polynomial:  Sigmoid:  Gaussian RBF:

 Can any function K(u,v) be a valid kernel function?

 That is, does there exist a function Ф with K(u,v) = Ф(u)T Ф(v)?  In the case of Mercer’s condition, it is a valid kernel function.

   

 

           

2 2

( , ) ( ) ( , ) , ( , ) tanh ( , ) exp /2

T d T T

K u v u v K u v u v c c K u v u v K u v u v

slide-8
SLIDE 8

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Mercer’s Theorem

 If for any squared integrable function f(.), we have

then the function K(x, x’) is a valid kernel function.  In this case the components of the corresponding function Ф are proportional to the eigenfunctions of K(x, x’), that is In fact Mercer’s theorem checks that if K(x, y) is positive semi-definite and hence all 𝝁i ≥ 0. .

2

( , ) ( ) ( )

n

R

K x x f x f x dxdx   

                       

1 1 2 2

( ) ( ) ( ) ( , ) ( ) ( )

n

i i i R

x x x K u v v dv u

slide-9
SLIDE 9

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Kernel ernel Mat Matri rix

 Restricting the kernel function to a set of points {x1, …, xk}, the kernel function can be represented with a matrix :  A matrix K is a valid kernel matrix if it is a positive semi-definite matrix,

 That is, all its eigenvalues are greater or equal to zero.  The eigenvectors multiplied by squared roots of eigenvalues will be the restrictions of φi to the set {x1, …, xk}.

1 1 1 2 1 2 1 1

( , ) ( , ) ( , ) ( , ) ( , ) ( , )

k k k k

K x x K x x K x x K x x K K x x K x x               

slide-10
SLIDE 10

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Polynomial Polynomial Kernel ernel

 2nd degree polynomial:  Up to 2nd degree polynomial:

 Can construct any 2nd order function in

  • riginal feature space

 

 

2 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 2

( , ) 2 2

T T

K u v u v u v u v u v u u v v u v                            

 

 

2 2 1 1 2 2 2 2 1 1 1 2 1 2 1 1 2 2 2 2 2 2

( , ) 1 1 2 2 2 2 2 2 1 1

T T

K u v u v u v u v u v u u v v u v u v u v                                                      

slide-11
SLIDE 11

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

RB RBF F Kernel ernel

 An example

 That is the input space -5<u<5 will be mapped to a curve using only 2 dimensions of Ф.

u (𝝌1, 𝝌2)

2

𝝌

1

𝝌

slide-12
SLIDE 12

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

RB RBF F Kernel ernel

 An example (cont.)

 Consider the Gaussian kernel :  Where u lies in a subset of R, -5<u<5.  The eigenfunctions of K are illustrated. Ф = (𝝌1, …, 𝝌10, …).

 

2 2

( , ) exp /2 K u v u v    

slide-13
SLIDE 13

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

RB RBF F Kernel ernel

 An example (cont.)

 Consider a linear classifier in the new space.  The corresponding classifier in the u space is clearly non-linear in the original space.

(𝝌1, 𝝌2)

2

𝝌

1

𝝌

C2 C1

u C2 C2 C1

slide-14
SLIDE 14

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

RB RBF F Kernel ernel

 RBF kernel considers a Gaussian around each data point.  Linear discriminant function cuts through the surface in embedding function.  Therefore any arbitrary set of points can be classified by RBF kernels.  Training error is zero when σ 0.

slide-15
SLIDE 15

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Kernel ernel Const Constructi ruction

  • n

 How to build valid kernels from existing kernels?  According to Mercer’s theorem if c > 0 and k1, k2 are valid kernels, and ψ is an arbitrary function, then following functions will also be valid kernels:

 K(u,v) = ck1(u,v)  K(u,v) = k1(u,v) + k2(u,v)  K(u,v) = k1(u,v) k2(u,v)  K(u,v) = k1(ψ(u), ψ(v))

slide-16
SLIDE 16

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Kernel ernel Const Constructi ruction

  • n

 Construct kernels from probabilistic generative models (class conditional probabilities, HMM, …) and then use the kernel in a discriminative model (such as SVM or linear discriminant functions, …).  K(x,x’) = p(x)p(x’) is clearly a valid kernel, which states that x and x’ are similar if they both have high probability (Why it is valid?).  A better kernel can be constructed in the same way :

 That is u and v are similar if they have high probabilities under same classes.

1

( , ) ( | ) ( | ) ( )

n i i i i

K u v p u c p v c p c



slide-17
SLIDE 17

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

Kernel ernel Const Constructi ruction

  • n

 State of the arts methods tries to learn the kernel from (probably many) training points.  The simplest one is the multiple kernel learning.

 Consider {k1, …, kn} as n valid kernels.  Find an appropriate kernel, k(u,v), from the training data  Minimize training loss (MSE) by changing ci and simultaneously minimize trace of the kernel matrix on training data to avoid overfitting.  Many variations of the algorithm are developed.

 

1

( , ) ( , ),

n i i i i

K u v c k u v c

slide-18
SLIDE 18

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Exa Exampl mple e 1

Solution

slide-19
SLIDE 19

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Exa Exampl mple e 2

Solution

slide-20
SLIDE 20

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Any Q Any Questi uestion?

  • n?

End of Lecture 13 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1