Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - - PowerPoint PPT Presentation

kernels kernelization
SMART_READER_LITE
LIVE PREVIEW

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - - PowerPoint PPT Presentation

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE 174A Inner Product Matrix & PCA Given the centered data matrix X c : 1) Construct the inner product matrix K c = X c T X c 2) Compute its


slide-1
SLIDE 1

Kernels & Kernelization

Ken Kreutz-Delgado (Nuno Vasconcelos)

Winter 2012 — UCSD — ECE 174A

slide-2
SLIDE 2

2

Inner Product Matrix & PCA

Given the centered data matrix Xc:

  • 1) Construct the inner product matrix Kc = Xc

TXc

  • 2) Compute its eigendecomposition ( 

2, M )

PCA: For a covariance matrix  = LT

  • Principal Components are given by  = XcM 
  • 1
  • Principle Values are given by L1/2 = (1

/  n ) 

  • Projection of the centered data onto

the principal components is given by

This allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product (inner product) matrix Kc

1 1

M M

T T c c c c

X X X K

=   =

slide-3
SLIDE 3

3

The Inner Product Form

This turns out to be the case for many learning algorithms If you manipulate expressions a little bit, you can often write them in “dot product form” Definition: a learning algorithm is in inner product form if, given a training data set D = {(x1,y1), ..., (xn,yn)}, it only depends on the points Xi through their inner products

i

Xi

, Xj  = Xi TXj

For example, let’s look at k-means

slide-4
SLIDE 4

4

K-means Clustering

We saw that the k-means algorithm iterates between

  • 1) (re-) Classification:
  • 2) (re-) Estimation:

Note that:

2 *( )

argmin

i i

i x x  =

  • new

( )

1

i i j j

x n    = 

(  ( 

2

2

T i i i T T T i i i

x x x x x x      

  • =
  • =
slide-5
SLIDE 5

5

K-means Clustering

Combining this expansion with the sample mean formula, allows us to write the distance between a data sample xk and the class center i as a function of the inner products xi

, xj  = xi Txj :

( )

1

i i j j

x n  = 

2 ( ) ( ) ( ) 2

2 1

T T i i T i k i k k k j j l j jl

x x x x x x x n n 

  • =

 

slide-6
SLIDE 6

6

“The Kernel Trick”

Why is this interesting? Consider the following transformation

  • f the feature space:
  • Introduce a mapping to a “better”

(i.e., linearly separable) feature space :X  Z where, generally, dim(Z) > dim(X).

  • If a classification algorithm only depends on

the data through inner products then, in the transformed space, it depends on

x x x x x x x x x x x x

  • x

1 x 2 x x x x x x x x x x x x

  • x

1 x 3 x 2 x n

( 

( 

(  (

,

T i j i j

x x x x       =

slide-7
SLIDE 7

7

The Inner Product Implementation

In the transformed space, the learning algorithms

  • nly requires inner products

 (xi

), ( xj ) =  (xj)T (xi)

Note that we do not need to store the  (xj), but only the n2 (scalar) component values of the inner product matrix Interestingly, this holds even if  (x) takes its value in an infinite dimensional space.

  • We get a reduction from infinity to n2!
  • There is, however, still one problem:
  • When  (xj) is infinite dimensional the computation
  • f the inner product  (xi

), ( xj ) looks impossible.

slide-8
SLIDE 8

8

“The Kernel Trick”

“Instead of defining  (x ), then computing  (xi

) for each i,

and then computing  (xi

), ( xj ) for each pair (i,j), simply

define a kernel function and work with it directly.” K(x,z) is called an inner product or dot-product kernel Since we only use the kernel, why bother to define  (x )? Just define the kernel K(x,z) directly! Then we never have to deal with the complexity of  (x ). This is usually called “the kernel trick”

def

( , ) ( ), ( ) K x z x z   =  

slide-9
SLIDE 9

9

Important Questions

How do I know that if I pick a function bivariate function K(x,z), it is actually equivalent to an inner product?

  • Answer: In fact, in general it is not. (More about this later.)

If it is, how do I know what  (x ) is?

  • Answer: you may never know. E.g. the Gaussian kernel

is a very popular choice. But, it is not obvious what  (x ) is. However, on the positive side, we do not need to know how to choose  (x ). Choosing an admissible kernel K(x,z) is sufficient.

Why is it that using K(x,z) is easier/better?

  • Answer: Complexity management. let’s look at an example.

2

( ), ( ) ( , )

x z

x z K x z e

 

= = 

slide-10
SLIDE 10

10

Polynomial Kernels

In

d, consider the square of the inner product

between two vectors:

( 

2 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2 2 d d d T i i i i j j i i j d d i j i j i j d d d d

x z x z x z x z x x z z x x z z x x z z x x z z x x z z x x z z x x z z

= = = = =

      = = =            = =         

   

1 1 2 2 d d d d d d d d

x x z z x x z z x x z z    

slide-11
SLIDE 11

11

Polynomial Kernels

This can be written as Hence, we have ( 

1 1 1 2 1 2 1 1 1 2 1 1 2 1 2

, , , , , , , , , ( )

d T d d d d d d d d d

z z z z z z x z x x x x x x x x x x x x z z z z z z z                      =                     

 (x )T

( 

( 

2

2 d d 1 1 1 1 2 1 1 2

( , ) ( ) ( ) with : , , , , , , , ,

T T T d d d d d d

K x z x z x z x x x x x x x x x x x x x x    = =            

slide-12
SLIDE 12

12

Polynomial Kernels

The point is that:

  • The computation of  (x )T (z ) has complexity O(d

2)

  • The direct computation of K(x,z) = (xTz)2 has complexity O(d)

Direct evaluation is more efficient by a factor of d As d goes to infinity this allows a feasible implementation BTW, you just met another kernel family

  • This implements polynomials of second order
  • In general, the family of polynomial kernels is defined as
  • I don’t even want to think about writing down  (x )!

( 

 

( , ) 1 , 1,2,

k T

K x z x z k =  

slide-13
SLIDE 13

13

Kernel Summary

1. D not easy to deal with in X, apply feature transformation  :X  Z

Z ,

such that dim(Z) >> dim(X) 2. Constructing and computing  (x) directly is too expensive:

  • Write your learning algorithm in inner product form
  • Then, instead of  (x), we only need  (xi

), ( xj ) for all i and j,

which we can compute by defining an “inner product kernel”

and computing K(xi,xj) "i,j directly

  • Note: the matrix

is called the “Kernel matrix” or Gram matrix

3. Moral: Forget about  (x) and instead use K(x,z) from the start!

( , ) ( ), ( ) K x z x z    = 

( , )

i j

K K x z     =      

slide-14
SLIDE 14

14

Question?

What is a good inner product kernel?

  • This is a difficult question (see Prof. Lenckriet’s work)

In practice, the usual recipe is:

  • Pick a kernel from a library of known kernels
  • we have already met
  • the linear kernel K(x,z) = xTz
  • the Gaussian family
  • the polynomial family

2

( , )

x z

K x z e

  • =

( 

 

( , ) 1 , 1,2,

k T

K x z x z k =  

slide-15
SLIDE 15

15

Inner Product Kernel Families

Why introduce simple, known kernel families?

  • Obtain the benefits of a high-dimensional space without paying a

price in complexity (avoid the “curse of dimensionality”).

  • The kernel simply adds a few parameters (e.g.,  or k), whereas

learning it would imply introducing many parameters (up to n2)

How does one check whether K(x,z) is a kernel? Definition: a mapping k: X x X   (x,y)  k(x,y) is an inner product kernel if and only if k(x,y) =  (x ), ( y ) where  : X  H, H is a vector space and <.,.> is an inner product in H

x x x x x x x x x x x x

  • x

1 x 2 x x x x x x x x x x x x

  • o
  • o
  • x

1 x 3 x 2 x n

X H

slide-16
SLIDE 16

16

Positive Definite Matrices

Recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) xTAx > 0, "x  0

ii) All (real) eigenvalues of A satisfy li > 0 iii) All upper-left submatrices Ak have strictly positive determinant iv) There is a matrix R with independent columns with A = RTR

Upper left submatrices:

3 , 3 2 , 3 1 , 3 3 , 2 2 , 2 1 , 2 3 , 1 2 , 1 1 , 1 3 2 , 2 1 , 2 2 , 1 1 , 1 2 1 , 1 1

          =       = = a a a a a a a a a A a a a a A a A

slide-17
SLIDE 17

17

Positive definite matrices

Property (iv) is particularly interesting

  • In d, <x,y> = xTAy is an inner product kernel if and only if A is

positive definite (from definition of inner product).

  • From iv) this holds iif there is full column rank R such that A = RTR
  • Hence

<x,y> = xTAy = (Rx)T(Ry) =  (x)T  (y) with  :  d   d x  Rx

I.e. the inner product kernel k(x,z) = xTAz (A symmetric & positive definite) is the standard inner product in the range space

  • f the mapping  (x) = Rx
slide-18
SLIDE 18

18

How to does one extend this notion of positive definiteness

  • f quadratic forms to general bivariate functions?

Definition: a function k(x,y) is a positive definite kernel

  • n X xX if "i and "{x1, ..., xi}, xi X, the Gram matrix

is positive definite. Like in d, this (theoretically) allows us to check that we have a positive definite kernel

Positive Definite Kernels

( , )

i j

K k x x     =      

slide-19
SLIDE 19

19

Inner Product Kernels

Theorem: k(x,y), x,yX, is an inner product kernel if and

  • nly if it is a positive definite kernel

In summary, a kernel is an inner product:

  • If and only if the Gram matrix is positive definite for all possible

sequences {x1, ..., xi}, xi X

Does the kernel have to be an inner product kernel?

  • not necessarily. For example, neural networks can be seen as

implementing kernels that are not of this type. However:

  • You loose the parallelism. What you know about the

learning machine may no longer hold after you kernelize

  • Inner product kernels usually lead to convex learning
  • problems. Usually you loose this guarantee
  • therwise.
slide-20
SLIDE 20

20

Clustering

So far, this is mostly theoretical How does it affect real-life, practical algorithms? Consider, for example, the k-means algorithm:

  • 1) (re-) Classification:
  • 2) (re-) Estimation:

Can we kernelize the classification step?

2 *( )

argmin

i i

i x x  =

  • ( )

1

new i i j j

x n    = 

slide-21
SLIDE 21

21

Clustering

Well, we saw that This can then be kernelized into

(  (  ( 

(  ( 

( 

2 ( ) ( ) ( ) 2

2 ) 1 (

T T i k k k k j j T i i j l l i j

x x x x x n x x n        

  • =

 

 

2 ( ) ( ) ( ) 2

2 1

T T i i T i k i k k k j j l j jl

x x x x x x x n n 

  • =

 

slide-22
SLIDE 22

22

Clustering

Furthermore, this can be done with relative efficiency The assignment of the point

  • nly requires computing

for each cluster This is a sum of entries of the Gram Matrix

(  (  ( 

(  ( 

( 

2 ( ) ( ) ( ) 2

2 ) 1 (

T T i k i k k k j j T i i j l jl

x x x x x n x x n        

  • =

 

 

kth diagonal entry of Gram matrix Computed once per cluster when all points are assigned

( 

( 

( )

2

T i k j j

x x n  

slide-23
SLIDE 23

23

Clustering

Note, however, that generally we cannot explicitly compute This is often infinite dimensional ... In any case, if we define

  • a Gram matrix K(i) for each cluster (elements are inner

products between points in cluster)

  • and S(i) the scaled sum of the entries in this matrix

( 

( )

1

i i j j

x n    

( 

( 

( ) ( ) ( ) 2

1

T i i i j l jl

S x x n   =

slide-24
SLIDE 24

24

Clustering

Then we obtain the kernel k-means algorithm

  • 1) (re-) Classification:
  • 2) (re-) Estimation:

We no longer have access to the prototype for each cluster

( 

( 

( ) ( ) ( ) 2

1

T i i i j l jl

S x x n   =

( 

( 

* ( ) ( ) ,

2 ( ) argmin

T i i l l l l j i j

i x K S x x n     = 

  

slide-25
SLIDE 25

25

Clustering

With the right kernel this can work significantly better than regular k-means

k-means kernel k-means

slide-26
SLIDE 26

26

Clustering

But for other applications, where the prototypes are important, this may be useless E.g. compression via vector quantization (VQ): We can try replacing the prototype by the closest vector, but this is not necessarily optimal

slide-27
SLIDE 27

27

Kernelization of PCA

Given the centered data matrix Xc:

  • 1) Construct the inner product matrix Kc = Xc

TXc

  • 2) Compute its eigendecomposition ( 

2, M )

PCA: For a covariance matrix  = LT

  • Principal Components are given by  = XcM 
  • 1
  • Principle Values are given by L1/2 = (1

/  n ) 

  • Projection of the centered data onto

the principal components is given by

Note that most of this holds when we kernelize: we only have to change elements of Kc from xi

Txj to  (xi)T (xj)

  • However, we can no longer access the PCs  = XcM 
  • 1

1 1

M M

T T c c c c

X X X K

=   =

slide-28
SLIDE 28

28

Kernel Methods

Most learning algorithms can be kernelized

  • Kernel Principal Component Analysis (PCA)
  • Kernel Linear Discriminant Analysis (LDA)
  • Kernel Independent Analysis (ICA)
  • Etc.

As in kernalized k-means clustering, sometimes we loose some of the features of the original algorithm But the performance is frequently better The canonical application: The Support Vector Machine (SVM)

slide-29
SLIDE 29

29

END