Kernel Methods
Barnabás Póczos
Kernel Methods Barnabs Pczos Outline Quick Introduction Feature - - PowerPoint PPT Presentation
Kernel Methods Barnabs Pczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercers theorem Finite domain Arbitrary domain Kernel families Constructing new kernels from
Barnabás Póczos
2
3
5
x=0
Positive “plane” Negative “plane”
x=0
features (mapping the data to a larger feature space) the data might become linearly separable
linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces (For example 4 points in 3D)
taken from Andrew W. Moore
6
Make up a new feature! Sort of… … computed from
x=0
2 k k k
Separable! MAGIC! Now drop this “augmented” data into our linear SVM.
taken from Andrew W. Moore
7
linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces
data into a feature space with dimension m-1?
Even if we don’t know how many test data we have...
feature space
We don’t care now...
8
1
9
Several algorithms use the inner products only, but not the feature values!!! E.g. Perceptron, SVM, Gaussian Processes...
10
11
Maximize
k
k1 R
1 2 klQkl
l1 R
k1 R
where
Subject to these constraints:
k1 R
12
So we need the inner product between and Looks ugly, and needs lots of computation... Can’t we just say that let
13
r r r n n
14
Lemma: Proof:
15
Choose 7 2D points Choose a kernel k 1 2 3 4 5 6 7 G =
1.0000 0.8131 0.9254 0.9369 0.9630 0.8987 0.9683 0.8131 1.0000 0.8745 0.9312 0.9102 0.9837 0.9264 0.9254 0.8745 1.0000 0.8806 0.9851 0.9286 0.9440 0.9369 0.9312 0.8806 1.0000 0.9457 0.9714 0.9857 0.9630 0.9102 0.9851 0.9457 1.0000 0.9653 0.9862 0.8987 0.9837 0.9286 0.9714 0.9653 1.0000 0.9779 0.9683 0.9264 0.9440 0.9857 0.9862 0.9779 1.0000
16
U =
D =
6.6315 0 0 0 0 0 0 0.2331 0 0 0 0 0 0 0 0.1272 0 0 0 0 0 0 0 0.0066 0 0 0 0 0 0 0 0.0016 0 0 0 0 0 0 0 0.000 0 0 0 0 0 0 0 0.000
17
Mapped points =
0.2655 -0.3184 0.1452 -0.0681 0.0983 -0.1573 0.0325 0.1210 -0.0599 -0.2391 0.1998 -0.0802 -0.0170 0.0719 0.0511 0.0419 -0.0178 -0.0382 -0.0095 -0.0079 -0.0168 0.0040 0.0077 0.0185 0.0197 -0.0174 -0.0146 -0.0163
18
We need feature maps Implicit (kernel functions) Explicit (feature maps) Several algorithms need the inner products of features only! It is much easier to use implicit feature maps (kernels) Is it a kernel function??? Is it a kernel function???
19
Mercer’s theorem, eigenfunctions, eigenvalues Positive semi def. integral operators Infinite dim feature space (l2)
Is it a kernel function??? SVD, eigenvectors, eigenvalues Positive semi def. matrices Finite dim feature space
We have to think about the test data as well...
If the kernel is pos. semi def. , feature map construction
20
2 variables 1 variable
21
22
We want to know which functions are kernels
We will show another way using RKHS: Inner product=???
24
What would SVMs do with this data? Not a big surprise
x=0
Positive “plane” Negative “plane”
x=0
Doesn’t look like slack variables will save us this time…
taken from Andrew W. Moore
25
Make up a new feature! Sort of… … computed from
x=0
2 k k k
New features are sometimes called basis functions. Separable! MAGIC! Now drop this “augmented” data into our linear SVM.
taken from Andrew W. Moore
26
O O X X
Let us map this point to the 3rd dimension...
27
We will use linear classifiers in this feature space.
28
Picture is taken from R. Herbrich
29
Picture is taken from R. Herbrich
30
Feature functions
31
32
33
Picture is taken from R. Herbrich
34
35
The Dual Algorithm in the feature space
36
Picture is taken from R. Herbrich
37
38
Definition: (kernel)
39
Definition: (Gram matrix, kernel matrix) Definition: (Feature space, kernel space)
40
The Gram matrix is symmetric, PSD matrix. Proof: Definition:
41
Key idea:
42
43
r r r n n
44
Lemma: Proof:
45
We have seen: Lemma: These conditions are necessary
46
Proof: ... wrong in the Herbrich’s book...
47
Summary: How to generalize this to general sets???
48
Definition: Integral operator with kernel k(.,.) Remark:
49
is a mapping from the integers {1,2,..., n} to <
w = (w[1], w[2], ..., w[n], ...) where w is mapping from {1,2,...} to <
1 2 1 2 1 1
G
v
i j
50
From integers we can further extend to
51
.
Picture is taken from R. Herbrich
52
Picture is taken from R. Herbrich
53
Picture is taken from R. Herbrich
54
Definition: inner product, Hilbert spaces
55
Definition: Eigenvalue, Eigenfunction
56
Definition: Positive Definite Operator
57
2 variables 1 variable
58
59
Theorem: nicer kernel characterization
60
measure between objects.
classifier nonlinear in the input space:
) Mercer kernel k
) Mercer map
61
are also kernels.
Picture is taken from R. Herbrich
62
Picture is taken from R. Herbrich
63
64
Note:
65
Picture is taken from R. Herbrich
66
Equivalent to (x) of infinite dimensionality!
2
67
Note: Proof:
68
Note: Note: Proof:
69
70
Make up a new feature! Sort of… … computed from
x=0
2 k k k
New features are sometimes called basis functions. Separable! MAGIC! Now drop this “augmented” data into our linear SVM.
taken from Andrew W. Moore
71
2 , x2 2, …
Why not ALL OF THEM???
72
(1, x1, x2, x3, x1
2, x2 2, x3 2, x1x2, x1x3, x2x3 )
2
In general, the dimension of the quadratic map:
taken from Andrew W. Moore
73
N N N N N N
x x x x x x x x x x x x x x x x x x x
1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1
2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) (
Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms What about those ?? … stay tuned
2
Let
taken from Andrew W. Moore
74
N N N N N N N N N N N
b b b b b b b b b b b b b b b b b b aN a a a a a a a a a a a a a a a a a
1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1 1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1
2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) ( ), ( b a
N i i ib
a
1
2
N i i i b
a
1 2 2
N i N i j j i j i
b b a a
1 1
2
+ + +
taken from Andrew W. Moore
75
) ( ), ( b a
N i N i j j i j i N i i i N i i i
b b a a b a b a
1 1 1 2 1
2 ) ( 2 1
Now consider another fn of a and b
(ab1)2 (ab)2 2ab1 1 2
1 2 1
N i i i N i i i
b a b a 1 2
1 1 1
N i i i N i N j j j i i
b a b a b a 1 2 2 ) (
1 1 1 1 2
N i i i N i N i j j j i i N i i i
b a b a b a b a
They’re the same! And this is only O(N) to compute… not O(N2)
taken from Andrew W. Moore
76
Poly- nomial
(x)
Cost to build Qkl matrix: traditional Cost if 100 inputs
(a)∙(b) Cost to
build Qkl matrix: sneaky Cost if 100 inputs Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2 500 R2 (a∙b+1)2 m R2 / 2 50 R2 Cubic All N3/6 terms up to degree 3 N3 m2 /12 83 000 m2 (a∙b+1)3 N m2 / 2 50 m2 Quartic All N4/24 terms up to degree 4 N4 m2 /48 1960000m2 (a∙b+1)4 N m2 / 2 50 m2
Poly- nomial
(x)
Cost to build Qkl matrix: traditional Cost if N=100 dim inputs
(a)∙(b) Cost to
build Qkl matrix: sneaky Cost if 100 dim inputs Quadratic All N2/2 terms up to degree 2 N2 m2 /4 2 500 m2 (a∙b+1)2 N m2 / 2 50 m2
taken from Andrew W. Moore
77
We are going to map these to a larger space
We want to show that this k is a kernel function
78
P factors We are going to map these to a larger space
79
We already know: We want to get k in this form:
80
For example
We already know:
81
82
) k is really a kernel!
83
84
Now, we show another way using RKHS
What objective do we want to optimize?
1., 2.,
85
1st term, empirical loss 2nd term, regularization
3., How can we minimize the objective over functions???
(nope, we do not like that...)
dimensional optimization only (yummy...)
The Representer theorem will help us here
86
Now, we show another way using RKHS
Completing (closing) a pre-Hilbert space ) Hilbert space
Now, we show another way using RKHS
87
The inner product:
(*)
88
Note: Proof:
(*)
89
Lemma:
Like the Euclidean space with rational scalars only
Like the Euclidean space with real scalars Proof:
90
Lemma: (Reproducing property) Lemma: The constructed features match to k
Huhh...
91
Proof of property 4.,:
CBS For CBS we don’t need 4., we need only that <0,0>=0!
92
We now have two methods to construct feature maps from kernels Well, these feature spaces are all isomorph with each
93
In the perceptron problem we could use the dual algorithm, because we had this representation:
94
Theorem:
1st term, empirical loss 2nd term, regularization
95
Proof of Representer Theorem: Message: Optimizing in general function classes is difficult, but in RKHS it is only finite! (m) dimensional problem
96
Proof of Representer Theorem
1st term, empirical loss 2nd term, regularization
97
1st term, empirical loss 2nd term, regularization
98
99
100