Kernel Methods Barnabs Pczos Outline Quick Introduction Feature - - PowerPoint PPT Presentation

kernel methods
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods Barnabs Pczos Outline Quick Introduction Feature - - PowerPoint PPT Presentation

Kernel Methods Barnabs Pczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercers theorem Finite domain Arbitrary domain Kernel families Constructing new kernels from


slide-1
SLIDE 1

Kernel Methods

Barnabás Póczos

slide-2
SLIDE 2

2

Outline

  • Quick Introduction
  • Feature space
  • Perceptron in the feature space
  • Kernels
  • Mercer’s theorem
  • Finite domain
  • Arbitrary domain
  • Kernel families
  • Constructing new kernels from kernels
  • Constructing feature maps from kernels
  • Reproducing Kernel Hilbert Spaces (RKHS)
  • The Representer Theorem
slide-3
SLIDE 3

3

Ralf Herbrich: Learning Kernel Classifiers Chapter 2

slide-4
SLIDE 4

Quick Overview

slide-5
SLIDE 5

5

Hard 1-dimensional Dataset

x=0

Positive “plane” Negative “plane”

x=0

  • If the data set is not linearly separable, then adding new

features (mapping the data to a larger feature space) the data might become linearly separable

  • m general! points in an m-1 dimensional space is always

linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces (For example 4 points in 3D)

taken from Andrew W. Moore

slide-6
SLIDE 6

6

Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

  • riginal feature(s)

x=0

) , (

2 k k k

x x  z

Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

taken from Andrew W. Moore

slide-7
SLIDE 7

7

Feature mapping

  • m general! points in an m-1 dimensional space is always

linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces

  • Having m training data, is it always enough to map the

data into a feature space with dimension m-1?

  • Nope... We have to think about the test data as well!

Even if we don’t know how many test data we have...

  • We might want to map our data to a huge (1) dimensional

feature space

  • Overfitting? Generalization error?...

We don’t care now...

slide-8
SLIDE 8

8

Feature mapping, but how???

1

slide-9
SLIDE 9

9

Observation

Several algorithms use the inner products only, but not the feature values!!! E.g. Perceptron, SVM, Gaussian Processes...

slide-10
SLIDE 10

10

The Perceptron

slide-11
SLIDE 11

11

Maximize

฀ k

k1 R

 1 2 klQkl

l1 R

k1 R

where

฀ Qkl  ykyl(xk  xl)

Subject to these constraints:

฀ 0 k  C k

฀ kyk

k1 R

 0

฀ k

SVM

slide-12
SLIDE 12

12

Inner products

So we need the inner product between and Looks ugly, and needs lots of computation... Can’t we just say that let

slide-13
SLIDE 13

13

Finite example =

r r r n n

slide-14
SLIDE 14

14

Finite example

Lemma: Proof:

slide-15
SLIDE 15

15

Finite example

Choose 7 2D points Choose a kernel k 1 2 3 4 5 6 7 G =

1.0000 0.8131 0.9254 0.9369 0.9630 0.8987 0.9683 0.8131 1.0000 0.8745 0.9312 0.9102 0.9837 0.9264 0.9254 0.8745 1.0000 0.8806 0.9851 0.9286 0.9440 0.9369 0.9312 0.8806 1.0000 0.9457 0.9714 0.9857 0.9630 0.9102 0.9851 0.9457 1.0000 0.9653 0.9862 0.8987 0.9837 0.9286 0.9714 0.9653 1.0000 0.9779 0.9683 0.9264 0.9440 0.9857 0.9862 0.9779 1.0000

slide-16
SLIDE 16

16

[U,D]=svd(G), UDUT=G, UUT=I

U =

  • 0.3709 0.5499 0.3392 0.6302 0.0992 -0.1844 -0.0633
  • 0.3670 -0.6596 -0.1679 0.5164 0.1935 0.2972 0.0985
  • 0.3727 0.3007 -0.6704 -0.2199 0.4635 -0.1529 0.1862
  • 0.3792 -0.1411 0.5603 -0.4709 0.4938 0.1029 -0.2148
  • 0.3851 0.2036 -0.2248 -0.1177 -0.4363 0.5162 -0.5377
  • 0.3834 -0.3259 -0.0477 -0.0971 -0.3677 -0.7421 -0.2217
  • 0.3870 0.0673 0.2016 -0.2071 -0.4104 0.1628 0.7531

D =

6.6315 0 0 0 0 0 0 0.2331 0 0 0 0 0 0 0 0.1272 0 0 0 0 0 0 0 0.0066 0 0 0 0 0 0 0 0.0016 0 0 0 0 0 0 0 0.000 0 0 0 0 0 0 0 0.000

slide-17
SLIDE 17

17

Mapped points=sqrt(D)*UT

Mapped points =

  • 0.9551 -0.9451 -0.9597 -0.9765 -0.9917 -0.9872 -0.9966

0.2655 -0.3184 0.1452 -0.0681 0.0983 -0.1573 0.0325 0.1210 -0.0599 -0.2391 0.1998 -0.0802 -0.0170 0.0719 0.0511 0.0419 -0.0178 -0.0382 -0.0095 -0.0079 -0.0168 0.0040 0.0077 0.0185 0.0197 -0.0174 -0.0146 -0.0163

  • 0.0011 0.0018 -0.0009 0.0006 0.0032 -0.0045 0.0010
  • 0.0002 0.0004 0.0007 -0.0008 -0.0020 -0.0008 0.0028
slide-18
SLIDE 18

18

Roadmap I

We need feature maps Implicit (kernel functions) Explicit (feature maps) Several algorithms need the inner products of features only! It is much easier to use implicit feature maps (kernels) Is it a kernel function??? Is it a kernel function???

slide-19
SLIDE 19

19

Mercer’s theorem, eigenfunctions, eigenvalues Positive semi def. integral operators Infinite dim feature space (l2)

Roadmap II

Is it a kernel function??? SVD, eigenvectors, eigenvalues Positive semi def. matrices Finite dim feature space

We have to think about the test data as well...

If the kernel is pos. semi def. , feature map construction

slide-20
SLIDE 20

20

Mercer’s theorem

(*)

2 variables 1 variable

slide-21
SLIDE 21

21

Mercer’s theorem

...

slide-22
SLIDE 22

22

Roadmap III

We want to know which functions are kernels

  • How to make new kernels from old kernels?
  • The polynomial kernel:

We will show another way using RKHS: Inner product=???

slide-23
SLIDE 23

Ready for the details? ;)

slide-24
SLIDE 24

24

Hard 1-dimensional Dataset

What would SVMs do with this data? Not a big surprise

x=0

Positive “plane” Negative “plane”

x=0

Doesn’t look like slack variables will save us this time…

taken from Andrew W. Moore

slide-25
SLIDE 25

25

Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

  • riginal feature(s)

x=0

) , (

2 k k k

x x  z

New features are sometimes called basis functions. Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

taken from Andrew W. Moore

slide-26
SLIDE 26

26

Hard 2-dimensional Dataset

O O X X

Let us map this point to the 3rd dimension...

slide-27
SLIDE 27

27

Kernels and Linear Classifiers

We will use linear classifiers in this feature space.

slide-28
SLIDE 28

28

Picture is taken from R. Herbrich

slide-29
SLIDE 29

29

Picture is taken from R. Herbrich

slide-30
SLIDE 30

30

Kernels and Linear Classifiers

Feature functions

slide-31
SLIDE 31

31

Back to the Perceptron Example

slide-32
SLIDE 32

32

The Perceptron

  • The primal algorithm in the feature space
slide-33
SLIDE 33

33

The primal algorithm in the feature space

Picture is taken from R. Herbrich

slide-34
SLIDE 34

34

The Perceptron

slide-35
SLIDE 35

35

The Perceptron

The Dual Algorithm in the feature space

slide-36
SLIDE 36

36

The Dual Algorithm in the feature space

Picture is taken from R. Herbrich

slide-37
SLIDE 37

37

The Dual Algorithm in the feature space

slide-38
SLIDE 38

38

Kernels

Definition: (kernel)

slide-39
SLIDE 39

39

Kernels

Definition: (Gram matrix, kernel matrix) Definition: (Feature space, kernel space)

slide-40
SLIDE 40

40

Kernel technique

Lemma:

The Gram matrix is symmetric, PSD matrix. Proof: Definition:

slide-41
SLIDE 41

41

Kernel technique

Key idea:

slide-42
SLIDE 42

42

Kernel technique

slide-43
SLIDE 43

43

Finite example =

r r r n n

slide-44
SLIDE 44

44

Finite example

Lemma: Proof:

slide-45
SLIDE 45

45

Kernel technique, Finite example

We have seen: Lemma: These conditions are necessary

slide-46
SLIDE 46

46

Kernel technique, Finite example

Proof: ... wrong in the Herbrich’s book...

slide-47
SLIDE 47

47

Kernel technique, Finite example

Summary: How to generalize this to general sets???

slide-48
SLIDE 48

48

Integral operators, eigenfunctions

Definition: Integral operator with kernel k(.,.) Remark:

slide-49
SLIDE 49

49

From Vector domain to Functions

  • Observe that each vector v = (v[1], v[2], ..., v[n])

is a mapping from the integers {1,2,..., n} to <

  • We can generalize this easily to INFINITE domain

w = (w[1], w[2], ..., w[n], ...) where w is mapping from {1,2,...} to <

1 2 1 2 1 1

G

v

i j

slide-50
SLIDE 50

50

From Vector domain to Functions

From integers we can further extend to

  • < or
  • <m
  • Strings
  • Graphs
  • Sets
  • Whatever
slide-51
SLIDE 51

51

Lp and lp spaces

.

Picture is taken from R. Herbrich

slide-52
SLIDE 52

52

Lp and lp spaces

Picture is taken from R. Herbrich

slide-53
SLIDE 53

53

L2 and l2 special cases

Picture is taken from R. Herbrich

slide-54
SLIDE 54

54

Kernels

Definition: inner product, Hilbert spaces

slide-55
SLIDE 55

55

Integral operators, eigenfunctions

Definition: Eigenvalue, Eigenfunction

slide-56
SLIDE 56

56

Positive (semi) definite operators

Definition: Positive Definite Operator

slide-57
SLIDE 57

57

Mercer’s theorem

(*)

2 variables 1 variable

slide-58
SLIDE 58

58

Mercer’s theorem

...

slide-59
SLIDE 59

59

A nicer characterization

Theorem: nicer kernel characterization

slide-60
SLIDE 60

60

Kernel Families

  • Kernels have the intuitive meaning of similarity

measure between objects.

  • So far we have seen two ways for making a linear

classifier nonlinear in the input space:

  • 1. (explicit) Choosing a mapping 

) Mercer kernel k

  • 2. (implicit) Choosing a Mercer kernel k

) Mercer map 

slide-61
SLIDE 61

61

Designing new kernels from kernels

are also kernels.

Picture is taken from R. Herbrich

slide-62
SLIDE 62

62

Designing new kernels from kernels

Picture is taken from R. Herbrich

slide-63
SLIDE 63

63

Designing new kernels from kernels

slide-64
SLIDE 64

64

Kernels on inner product spaces

Note:

slide-65
SLIDE 65

65

Picture is taken from R. Herbrich

slide-66
SLIDE 66

66

Common Kernels

  • Polynomials of degree d
  • Polynomials of degree up to d
  • Sigmoid
  • Gaussian kernels

Equivalent to (x) of infinite dimensionality!

2

slide-67
SLIDE 67

67

The RBF kernel

Note: Proof:

slide-68
SLIDE 68

68

The RBF kernel

Note: Note: Proof:

slide-69
SLIDE 69

69

The Polynomial kernel

slide-70
SLIDE 70

70

Reminder: Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

  • riginal feature(s)

x=0

) , (

2 k k k

x x  z

New features are sometimes called basis functions. Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

taken from Andrew W. Moore

slide-71
SLIDE 71

71

… New Features from Old …

  • Here: mapped   2 by : x  [x, x2]
  • Found “extra dimensions”  linearly separable!
  • In general,
  • Start with vector x N
  • Want to add in x1

2 , x2 2, …

  • Probably want other terms – eg x2  x7, …
  • Which ones to include?

Why not ALL OF THEM???

slide-72
SLIDE 72

72

Special Case

  • x=(x1, x2, x3 ) 

(1, x1, x2, x3, x1

2, x2 2, x3 2, x1x2, x1x3, x2x3 )

  • 3  10, N=3, n=10;

2 2 ) 1 )( 2 ( 2 1

2

N N N N N N N                

In general, the dimension of the quadratic map:

taken from Andrew W. Moore

slide-73
SLIDE 73

73

Quadratic Basis Functions

                                                           

 N N N N N N

x x x x x x x x x x x x x x x x x x x

1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1

2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) (

Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms What about those ?? … stay tuned

2

Let

taken from Andrew W. Moore

slide-74
SLIDE 74

74

Quadratic Dot Products

                                                                                                                        

  N N N N N N N N N N N

b b b b b b b b b b b b b b b b b b aN a a a a a a a a a a a a a a a a a

1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1 1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1

2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) ( ), ( b a

1

 N i i ib

a

1

2

 N i i i b

a

1 2 2

 

   N i N i j j i j i

b b a a

1 1

2

+ + +

taken from Andrew W. Moore

slide-75
SLIDE 75

75

Quadratic Dot Products

    ) ( ), ( b a

   

    

  

N i N i j j i j i N i i i N i i i

b b a a b a b a

1 1 1 2 1

2 ) ( 2 1

Now consider another fn of a and b

฀ (ab1)2 ฀  (ab)2  2ab1 1 2

1 2 1

        

 

  N i i i N i i i

b a b a 1 2

1 1 1

  

 

   N i i i N i N j j j i i

b a b a b a 1 2 2 ) (

1 1 1 1 2

   

   

     N i i i N i N i j j j i i N i i i

b a b a b a b a

They’re the same! And this is only O(N) to compute… not O(N2)

taken from Andrew W. Moore

slide-76
SLIDE 76

76

Higher Order Polynomials

Poly- nomial

(x)

Cost to build Qkl matrix: traditional Cost if 100 inputs

(a)∙(b) Cost to

build Qkl matrix: sneaky Cost if 100 inputs Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2 500 R2 (a∙b+1)2 m R2 / 2 50 R2 Cubic All N3/6 terms up to degree 3 N3 m2 /12 83 000 m2 (a∙b+1)3 N m2 / 2 50 m2 Quartic All N4/24 terms up to degree 4 N4 m2 /48 1960000m2 (a∙b+1)4 N m2 / 2 50 m2

฀ Qkl  ykyl(xk  xl)

Poly- nomial

(x)

Cost to build Qkl matrix: traditional Cost if N=100 dim inputs

(a)∙(b) Cost to

build Qkl matrix: sneaky Cost if 100 dim inputs Quadratic All N2/2 terms up to degree 2 N2 m2 /4 2 500 m2 (a∙b+1)2 N m2 / 2 50 m2

taken from Andrew W. Moore

slide-77
SLIDE 77

77

The Polynomial kernel, General case

We are going to map these to a larger space

We want to show that this k is a kernel function

slide-78
SLIDE 78

78

The Polynomial kernel, General case

P factors We are going to map these to a larger space

slide-79
SLIDE 79

79

The Polynomial kernel, General case

We already know: We want to get k in this form:

slide-80
SLIDE 80

80

The Polynomial kernel

For example

We already know:

slide-81
SLIDE 81

81

The Polynomial kernel

slide-82
SLIDE 82

82

The Polynomial kernel

) k is really a kernel!

slide-83
SLIDE 83

83

Reproducing Kernel Hilbert Spaces

slide-84
SLIDE 84

84

RKHS, Motivation

Now, we show another way using RKHS

What objective do we want to optimize?

1., 2.,

slide-85
SLIDE 85

85

RKHS, Motivation

1st term, empirical loss 2nd term, regularization

3., How can we minimize the objective over functions???

  • Be PARAMETRIC!!!...

(nope, we do not like that...)

  • Use RKHS, and suddenly the problem will be finite

dimensional optimization only (yummy...)

The Representer theorem will help us here

slide-86
SLIDE 86

86

Reproducing Kernel Hilbert Spaces

Now, we show another way using RKHS

Completing (closing) a pre-Hilbert space ) Hilbert space

Now, we show another way using RKHS

slide-87
SLIDE 87

87

Reproducing Kernel Hilbert Spaces

The inner product:

(*)

slide-88
SLIDE 88

88

Reproducing Kernel Hilbert Spaces

Note: Proof:

(*)

slide-89
SLIDE 89

89

Reproducing Kernel Hilbert Spaces

Lemma:

  • Pre-Hilbert space:

Like the Euclidean space with rational scalars only

  • Hilbert space:

Like the Euclidean space with real scalars Proof:

slide-90
SLIDE 90

90

Reproducing Kernel Hilbert Spaces

Lemma: (Reproducing property) Lemma: The constructed features match to k

Huhh...

slide-91
SLIDE 91

91

Reproducing Kernel Hilbert Spaces

Proof of property 4.,:

  • rep. property

CBS For CBS we don’t need 4., we need only that <0,0>=0!

slide-92
SLIDE 92

92

Methods to Construct Feature Spaces

We now have two methods to construct feature maps from kernels Well, these feature spaces are all isomorph with each

  • ther... 
slide-93
SLIDE 93

93

The Representer Theorem

In the perceptron problem we could use the dual algorithm, because we had this representation:

slide-94
SLIDE 94

94

The Representer Theorem

Theorem:

1st term, empirical loss 2nd term, regularization

slide-95
SLIDE 95

95

The Representer Theorem

Proof of Representer Theorem: Message: Optimizing in general function classes is difficult, but in RKHS it is only finite! (m) dimensional problem

slide-96
SLIDE 96

96

Proof of the Representer Theorem

Proof of Representer Theorem

1st term, empirical loss 2nd term, regularization

slide-97
SLIDE 97

97

1st term, empirical loss 2nd term, regularization

Proof of the Representer Theorem

slide-98
SLIDE 98

98

Later will come

  • Supervised Learning
  • SVM using kernels
  • Gaussian Processes
  • Regression
  • Classification
  • Heteroscedastic case
  • Unsupervised Learning
  • Kernel Principal Component Analysis
  • Kernel Independent Component Analysis
  • Kernel Mutual Information
  • Kernel Generalized Variance
  • Kernel Canonical Correlation Analysis
slide-99
SLIDE 99

99

If we still have time…

  • Automatic Relevance Machines
  • Bayes Point Machines
  • Kernels on other objects
  • Kernels on graphs
  • Kernels on strings
  • Fisher kernels
  • ANOVA kernels
  • Learning kernels
slide-100
SLIDE 100

100

Thanks for the Attention! 