Machine Learning
Kernel Methods
Hamid R. Rabiee
Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban - - PowerPoint PPT Presentation
Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Motivations Kernel Definition Mercers Theorem Kernel Matrix Kernel Construction
Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
How to generalize existing efficient linear classifiers to non-linear ones.
Use an appropriate high dimensional non-linear map to change the feature space.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
That is the kernel function is the dot product in the new feature space. Dot product measures the similarity of two data points. K(x,x’) shows the similarity of x and x’. It is efficient to use K instead of Ф if the dimensionality of Ф is high (Why?).
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Consider x = (x1, x2) lies in 2 dimensional plane and Ф : R2 R3 with the following definition A linear classifier in new space will become (w’ is a vector in new space): What will be the shape of separating curve in the original space?
2 2 1 2 1 2 3 1 1 2 2
( , ) ( , , ) ( , 2 , ) x x z z z x x x x
2 2 1 1 2 1 2 3 2
( ) ' ' ' ' ( ) ' ' 2 ' ' '
T T
g x w x w w x w w x w x x w x w
2 2 1 1 2 1 2 3 2
' 2 ' ' ' w x w x x w x w
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
What will be the kernel function in the previous example? The dot product in the new space is squared of the dot product in the original space. Can we construct an arbitrary conic section in original feature space? Why? We instead use
2 2 1 1 1 2 1 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2
( , ) ( ) ( ) 2 2 2
T T T
u v K u v u v u u v v u v u v u v u v u v u v u v u v
2
( 1)
T
u v
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Linear Polynomial: Sigmoid: Gaussian RBF:
That is, does there exist a function Ф with K(u,v) = Ф(u)T Ф(v)? In the case of Mercer’s condition, it is a valid kernel function.
2 2
( , ) ( ) ( , ) , ( , ) tanh ( , ) exp /2
T d T T
K u v u v K u v u v c c K u v u v K u v u v
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
2
n
R
1 1 2 2
( ) ( ) ( ) ( , ) ( ) ( )
n
i i i R
x x x K u v v dv u
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
That is, all its eigenvalues are greater or equal to zero. The eigenvectors multiplied by squared roots of eigenvalues will be the restrictions of φi to the set {x1, …, xk}.
1 1 1 2 1 2 1 1
( , ) ( , ) ( , ) ( , ) ( , ) ( , )
k k k k
K x x K x x K x x K x x K K x x K x x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Can construct any 2nd order function in
2 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 2
( , ) 2 2
T T
K u v u v u v u v u v u u v v u v
2 2 1 1 2 2 2 2 1 1 1 2 1 2 1 1 2 2 2 2 2 2
( , ) 1 1 2 2 2 2 2 2 1 1
T T
K u v u v u v u v u v u u v v u v u v u v
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
That is the input space -5<u<5 will be mapped to a curve using only 2 dimensions of Ф.
u (𝝌1, 𝝌2)
2
𝝌
1
𝝌
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Consider the Gaussian kernel : Where u lies in a subset of R, -5<u<5. The eigenfunctions of K are illustrated. Ф = (𝝌1, …, 𝝌10, …).
2 2
( , ) exp /2 K u v u v
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
Consider a linear classifier in the new space. The corresponding classifier in the u space is clearly non-linear in the original space.
(𝝌1, 𝝌2)
2
𝝌
1
𝝌
u C2 C2 C1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
K(u,v) = ck1(u,v) K(u,v) = k1(u,v) + k2(u,v) K(u,v) = k1(u,v) k2(u,v) K(u,v) = k1(ψ(u), ψ(v))
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
That is u and v are similar if they have high probabilities under same classes.
1
( , ) ( | ) ( | ) ( )
n i i i i
K u v p u c p v c p c
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
Consider {k1, …, kn} as n valid kernels. Find an appropriate kernel, k(u,v), from the training data Minimize training loss (MSE) by changing ci and simultaneously minimize trace of the kernel matrix on training data to avoid overfitting. Many variations of the algorithm are developed.
1
n i i i i
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20