MACHINE LEARNING – 2012 1 MACHINE LEARNING
MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING - - PowerPoint PPT Presentation
MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING - - PowerPoint PPT Presentation
MACHINE LEARNING 2012 MACHINE LEARNING MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How to separate the red class from the grey class? x 2 360 r x 1 Polar coordinates Data
MACHINE LEARNING – 2012 2 MACHINE LEARNING
r
360
Kernels: Intuition
2
x
1
x
How to separate the red class from the grey class?
Polar coordinates Data become linearly separable
MACHINE LEARNING – 2012 3 MACHINE LEARNING
What is ?
r
360
Kernels: Intuition
2
x
1
x
2 2
: , Solve for , , s.t. sin , cos
i i i i i i i i i i
x r x r x r r x
x
i
x
i
x
Assume a model (equation) for the transformation. Need at least 3 datapoints to solve.
How to separate the red class from the grey class?
MACHINE LEARNING – 2012 4 MACHINE LEARNING
Original Space Idea: Send the data X into a feature space H through the nonlinear map .
1... 1 ,....., i M i N M
X x X x x
Feature Space H
Kernels: Intuition
r
360
i
x
2
x
1
x
i
x
In feature space, computation is simpler (e.g. perform linear classification)
MACHINE LEARNING – 2012
5
MACHINE LEARNING
In most cases, determining beforehand the transformation may be difficult. Which representation of the data allows to classify easily the three groups of datapoints?
Kernels: Intuition
MACHINE LEARNING – 2012
6
MACHINE LEARNING
In most cases, determining beforehand the transformation may be difficult.
What if the groups live in N dimensions, with N>>1. Grouping may require separate sets of dimensions and can no longer be visualized
Kernels: Intuition
MACHINE LEARNING – 2012 7 MACHINE LEARNING
While the dimension of the
- riginal
space is N, the dimension
- f
the feature space may be greater than N! X is lifted onto H Determining is difficult Kernel Trick
Kernel-Induced Feature Space
Idea: Send the data X into a feature space H through the nonlinear map .
1... 1 ,....., i M i N M
X x X x x
Original Space x2
H
MACHINE LEARNING – 2012
8
MACHINE LEARNING
The Kernel Trick
In most cases, determining the transformation may be difficult. Kernel trick Key idea behind the kernel trick: Most algorithms for classification, regression
- r
clustering compute an inner product across pairs of observations to determine the separating line, the fit
- r
the grouping
- f
datapoints, respectively:
Inner product across two datapoints: ,
i j
x x
MACHINE LEARNING – 2012
9
MACHINE LEARNING
In most cases, determining the transformation may be difficult. No need to compute the transformation , if one expresses everything as a function of the inner product in feature space. Proceed as follows: 1) Define a kernel function:
: , , .
i j i j
k X X k x x x x
The function k can be used to determine a metric of similarity across datapoints in feature space. It can extract features that are either common or that distinguish groups of datapoints.
2) Use this transformation to perform classical classification, regression or clustering for the linear case.
The Kernel Trick
MACHINE LEARNING – 2012 10 MACHINE LEARNING
Use of Kernels: Example
Key idea: Some problems are made simpler if you change the representation of the data Which representation of the data allows to separate linearly the two groups of datapoints?
MACHINE LEARNING – 2012 11 MACHINE LEARNING
Use of Kernels: Example
Key idea: Some problems are made simpler if you change the representation of the data Data becomes linearly separable when projected onto two first principal component of kernel PCA with RBF kernel (see next lecture)
MACHINE LEARNING – 2012 12 MACHINE LEARNING
Use of Kernels: Example
Key idea: Some problems are made simpler if you change the representation of the data Which representation of the data allows to classify easily the three groups of datapoints in a different cluster?
MACHINE LEARNING – 2012 13 MACHINE LEARNING
Use of Kernels: Example
Key idea: Some problems are made simpler if you change the representation of the data Data are correctly clustered when using kernel K-means, using a RBF kernel (see lecture next week)
MACHINE LEARNING – 2012
14
MACHINE LEARNING
Popular Kernels
, ' , ' , ;
p
k x x x x p
2 2
' 2
, ' , .
x x
k x x e
- Homogeneous Polynomial Kernels:
- Gaussian / RBF Kernel (translation-invariant):
- Inhomogeneous Polynomial Kernels:
, ' , ' , ,
p
k x x x x c p c
MACHINE LEARNING – 2012
15
MACHINE LEARNING
Kernels: Exercise I
2 2
' 2 1 1 1 2 1 2
Using the RBF kernel: , ' , , a) draw the isolines for:
- one datapoint , i..e.: Find all , s.t.
, .
- two datapoints, ,
: Find all , s.t. , , .
- two datapo
x x
k x x e x x k x x cst x x x k x x k x x cst
1 2 1 2
ints, , : Find all , s.t. , , .
- three datapoints
Discuss the effect of on the isolines. c) determine a metric in feature space x x x k x x k x x cst
MACHINE LEARNING – 2012
16
MACHINE LEARNING
RBF Kernel; M=1, i.e. 1 data point
Kernels: Solution Exercise I
MACHINE LEARNING – 2012
17
MACHINE LEARNING
Gaussian Kernel; M=2, i.e. 2 data points
Kernels: Solution Exercise I
MACHINE LEARNING – 2012
18
MACHINE LEARNING
Gaussian Kernel; M=2, i.e. 2 data points
Kernels: Solution Exercise I
Small kernel width Large kernel width
MACHINE LEARNING – 2012
19
MACHINE LEARNING
Gaussian Kernel; M=3, i.e. 3 data points
Kernels: Solution Exercise I
MACHINE LEARNING – 2012
20
MACHINE LEARNING
Gaussian Kernel; M=3, i.e. 3 data points
Kernels: Solution Exercise I
MACHINE LEARNING – 2012
21
MACHINE LEARNING
Gaussian Kernel; M=3, i.e. 3 data points
Kernels: Solution Exercise I
MACHINE LEARNING – 2012
22
MACHINE LEARNING
Kernels: Exercise II
Using the homogeneous polynomial kernel: , ' , ' , , draw the isolines as in previous exercise for: a) one datapoint b) two datapoints c) three datapoints Discuss the effect of on the isolines.
p
k x x x x p p
MACHINE LEARNING – 2012
23
MACHINE LEARNING
Polynomial Kernel; order p=1, 2, 3; M=1, i.e. 1 data points
Kernels: Solution Exercise II
p=1 p=2 p=3
The isolines are lines perpendicular to the vector point from the origin. The order p does not change the geometry and only changes the values of the isolines.
MACHINE LEARNING – 2012
24
MACHINE LEARNING
Polynomial Kernel; order p=1, 2, 3; M=2, i.e. 2 data points
Kernels: Solution Exercise II
p=1 p=2 p=3
The isolines are lines perpendicular to the combination of the vector points for p=1. With p=2, we have an ellipse. With p=3, we have hyperbolas. P>3 are similar in concept with change in signs of the values of the isolines depending on whether we have odd or even p.
MACHINE LEARNING – 2012
25
MACHINE LEARNING
Polynomial Kernel; order p=1, 2, 3; M=3, i.e. 3 data points
Kernels: Solution Exercise II
Solutions with p>1 present a symmetry around the origin.
MACHINE LEARNING – 2012
26
MACHINE LEARNING
Kernels: Exercise III
Another also relatively popular kernel is the kernel: , ' '. 1) Can you tell what this kernel measures? 2) Find an application where using the kernel provides an in linea teresting measure r lin . ear
T
k x x x x
MACHINE LEARNING – 2012
27
MACHINE LEARNING
Kernels: Solution Exercise III
1000
Bags of words: [machine, learning, kernel, rbf, robot, vision, dimension, blue, speed,...] You want to group webpages with common groups of words. Set with each entry in set to 1 if the word i x x
1
s present else zero. E.g. 1,1,1,0,0,0,.... contains the words machine learning and kernel and nothing else. Features live in low-dimensional space (common group of webpages have a low number of combin x
1 1 2 2 3 3 4 4 1 1
ation of words): , ... The isoline , 3 delineate the set of webpages that share the same set of three keywords as .
i j T T T T T k k k j
k x x x x x x x x x x x x k x x x
MACHINE LEARNING – 2012
28
MACHINE LEARNING
Kernels: Solution Exercise III
1000
Sequence of strings (e.g genetic code): [IPTS VBUV,...] Want to group strings with common subgroups of strings. Set , the number of times sub-string appears in the string word. Apply same e L r QD x x x asoning as before for grouping.
MACHINE LEARNING – 2012
29
MACHINE LEARNING
How to choose kernels?
- There is no rule for choosing the right kernel; each kernel
must be adapted to a particular problem.
- There exist now methods to learn the kernel (it is one of
the proposed topics for literature survey)
- Kernel design can start from the desired feature space or
from data.
- Some considerations are important:
– Use the kernel to introduce apriori (domain) knowledge – Be sure to keep some structure in the feature space
MACHINE LEARNING – 2012
31
MACHINE LEARNING
Example: The Kernel Trick in Regression
; ,
T
y f x w b w x b
y x
b
: search a linear mapping between input and output , by Linear regression parametrized slope vect the and the intercept
- .
r
N N
w x y b
*
Omit by centering the data: ' and ' , , : mean on and ' ' ' with ' Least-square estimate of ' ' ' ' '.
T T T T
b y y y x x x x y x y y w x b b b w x y b y w x y w x
MACHINE LEARNING – 2012
32
MACHINE LEARNING
The Kernel Trick in Regression
;
T
y f x w w x
: search a linear mapping between input and output Linear regression parametriz , by the ed slope vecto . r
N N
w x y
y x
MACHINE LEARNING – 2012
33
MACHINE LEARNING
The Kernel Trick in Regression
y x
The problem consists of minimizing the loss function: , ; , Find such that , ; L x y w y w x w L x y w
MACHINE LEARNING – 2012
34
MACHINE LEARNING
The Kernel Trick in Regression
y x
1 2 1 2
Pair of training points [ ... ] and [ ... ] , .
M M T i N i
M X x x x y y y x y y
1
, ; ,
M i i i i
L x y w y w x
i
j
MACHINE LEARNING – 2012
35
MACHINE LEARNING
The Kernel Trick in Regression
2 1 1 2
1 1 X, ; 2 2 [ .... ] , : matrix with column vectors
i i
M T T T T i M T i
L w y w x X w X w y y y X x
y y y y
*
Optimal given by: 1 min 2
T T T w
w w X w X w y y
1 *
has an analytical solution: =
T T T T
X X w X w XX X XX w
y y y
MACHINE LEARNING – 2012
36
MACHINE LEARNING
The Kernel Trick in Regression
1 *
has an analytical solution: =
T
w XX X
y requires datapoints at minimum to solve Generally not too computationally intensive as .
N
w N N M
3
Requires
- perations!
O N It has an exact solution if: a) is not singular (it is singular with not enough datapoints) b) Data is not noisy (otherwise no single match to , )
T i i
XX y w x
MACHINE LEARNING – 2012
37
MACHINE LEARNING
1 *
has an analytical solution: =
T
w XX X
y
The Kernel Trick in Regression
1
pseudo-inverse If is singular, two solutions: a) Approximate with (minimize norm) b) Tradeoff size of norm against los Ridge Re s g ( ress ) ion
T T
XX XX
MACHINE LEARNING – 2012
38
MACHINE LEARNING
The Kernel Trick in Regression
Ridge Regression: 1 X, ; , 2
min min
T T T T w w
L w X w X w w w y y y
Regularization Term Introduces penalty for large weights Reduces number of solutions
* 1
Take derivative for :
T T
w X XX w w w X XX I
y y
Always invertible for
3
Complexity still O N
MACHINE LEARNING – 2012
39
MACHINE LEARNING
The Kernel Trick in Regression
1 : This solution is called t
a he Du l
T T
X X X I X
y y
: Gram matrix , : number datapoints
T
G XX M M M
3
Complexity O M
* 1 1 T T
w X X w w X X w
y y Take derivative for :
T
w X XX w w y
Ridge Regression: 1 X, ; , 2
min min
T T T T w w
L w X w X w w w y y y
MACHINE LEARNING – 2012
40
MACHINE LEARNING
The Kernel Trick in Regression
: Gram matrix , : number datapoints
T
G XX M M M
3
Complexity O M
3
Complexity O N
If , solve with the dual. If , solve with the primal. N M M N
* 1
Take derivative for :
T T
w X XX w w w X XX I
y y
1 : This solution is called t
a he Du l
T T
X X X I X
y y
Ridge Regression: 1 X, ; , 2
min min
T T T T w w
L w X w X w w w y y y
MACHINE LEARNING – 2012
41
MACHINE LEARNING
The Kernel Trick in Regression
The Gram matrix, , is composed of between training points: , use kernel trick to transfer to non-linear regress inner produ t i c
- n
s
T i j ij
G XX G x x
1 : This solution is called t
a he Du l
T T
X X X I X
y y
: Gram matrix , : number datapoints
T
G XX M M M
MACHINE LEARNING – 2012
42
MACHINE LEARNING
The Kernel Trick in Regression
y x
Estimate a non-linear function ; Assume that there exists a non-linear transformation, such that the problem becomes linear. , s.t. , . y f x w y x w
1 *
If true, solution with ridge regression is: , columns of are
T i
w X X X I X x
y
1 1
Predicted output for a query poin , t is:
T i M i
x x X X I x y
y
MACHINE LEARNING – 2012
43
MACHINE LEARNING
The Kernel Trick in Regression
Replace all between training points by kernel function : , , . The kernel function is easier to compute and does not require to kno inner product w s .
i j i j
k X X k x x x x
1 1
Predicted output for a query poin , t is:
T i M i
x x X X I x y
y
Gram Matrix in feature space 1 1
Predicted output for a query point become , s: ,
M i i
k x y x x K X I X
y
MACHINE LEARNING – 2012
44
MACHINE LEARNING
The Kernel Trick in Regression
See matlab example in supplementary documentation
MACHINE LEARNING – 2012
45
MACHINE LEARNING
Kernels: Summary
- Kernels are continuous positive, symmetric functions. The
most popular is the RBF kernel. It enables to group
- datapoints. The tightness of the grouping depends on the
value of the hyperparameter sigma.
- Kernels represent metrics in feature space (inner product of
projection of data in feature space)
- Kernels in ML are useful because they allow to reduce