kernels kernelization
play

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE 174A Inner Product Matrix & PCA Given the centered data matrix X c : 1) Construct the inner product matrix K c = X c T X c 2) Compute its


  1. Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 — UCSD — ECE 174A

  2. Inner Product Matrix & PCA Given the centered data matrix X c : • 1) Construct the inner product matrix K c = X c T X c • 2) Compute its eigendecomposition (  2 , M ) PCA: For a covariance matrix  = L T • Principal Components are given by  = X c M  -1 • Principle Values are given by L 1/2 = (1 /  n )  • Projection of the centered data onto the principal components is given by - -  =  =  T T 1 1 X X X M K M c c c c This allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product (inner product) matrix K c 2

  3. The Inner Product Form This turns out to be the case for many learning algorithms If you manipulate expressions a little bit, you can often write them in “dot product form” Definition: a learning algorithm is in inner product form if, given a training data set D = {( x 1 ,y 1 ) , ..., ( x n ,y n )} , it only depends on the points X i through their inner products i  X i , X j  = X i T X j For example, let’s look at k -means 3

  4. K-means Clustering We saw that the k-means algorithm iterates between • 1) (re-) Classification: 2 = -  * ( ) i x argmin x i i • 2) (re-) Estimation: =  1    new ( ) i x i j n j Note that: (  (  2 T -  = -  -  x x x i i i = -     T T T x x 2 x i i i 4

  5. K-means Clustering Combining this expansion with the sample mean formula,  =  1 ( ) i x i j n j allows us to write the distance between a data sample x k and the class center  i as a function of the inner products  x i , x j  = x i T x j : 2 1   2 -  = -  T T ( ) i ( ) i T ( ) i x x x x x x x k i k k k j j l 2 n n j jl 5

  6. “The Kernel Trick” x 2 Why is this interesting? x x x x x x Consider the following transformation x x o o x x of the feature space: o o o o x x o o o o • Introduce a mapping to a “better” o o x 1 (i.e., linearly separable) feature space   : X  Z where, generally, dim( Z ) > dim( X ) . x x x x x x • If a classification algorithm only depends on x x x x the data through inner products then, in the x o o x o o transformed space, it depends on o o x o 1 o o o  ( (   o o (  (     =   T x , x x x x n i j i j x 2 x 3 6

  7. The Inner Product Implementation In the transformed space, the learning algorithms only requires inner products   ( x i ),  ( x j )  =  ( x j ) T  ( x i ) Note that we do not need to store the  ( x j ), but only the n 2 (scalar) component values of the inner product matrix Interestingly, this holds even if  ( x ) takes its value in an infinite dimensional space. • We get a reduction from infinity to n 2 ! • There is, however, still one problem: • When  ( x j ) is infinite dimensional the computation of the inner product   ( x i ),  ( x j )  looks impossible. 7

  8. “The Kernel Trick” “ Instead of defining  ( x ), then computing  ( x i ) for each i, and then computing   ( x i ),  ( x j )  for each pair (i,j), simply define a kernel function def =     K x z ( , ) ( ), ( ) x z and work with it directly .” K(x,z) is called an inner product or dot-product kernel Since we only use the kernel, why bother to define  ( x )? Just define the kernel K(x,z) directly! Then we never have to deal with the complexity of  ( x ). This is usually called “ the kernel trick ” 8

  9. Important Questions How do I know that if I pick a function bivariate function K(x,z), it is actually equivalent to an inner product? • Answer: In fact, in general it is not. (More about this later.) If it is, how do I know what  ( x ) is? • Answer: you may never know. E.g. the Gaussian kernel 2 - x z - = =      K x z ( , ) e ( ), ( ) x z is a very popular choice. But, it is not obvious what  ( x ) is. However, on the positive side, we do not need to know how to choose  ( x ) . Choosing an admissible kernel K(x,z) is sufficient. Why is it that using K(x,z) is easier/better? • Answer: Complexity management . let’s look at an example. 9

  10. Polynomial Kernels d , consider the square of the inner product In between two vectors:   2     (  d d d    2 = = = T       x z x z x z x z i i i i j j      = = = i 1 i 1 j 1 d d  = x x z z i j i j = = i 1 j 1 =     x x z z x x z z x x z z d d 1 1 1 1 1 2 1 2 1 1      x x z z x x z z x x z z 2 1 2 1 2 2 2 2 2 d 2 d     x x z z x x z z x x z z d 1 d 1 d 2 d 2 d d d d 10

  11. Polynomial Kernels This can be written as (  2 = =   T T K x z ( , ) x z ( ) x ( ) z   2 d d with :   x 1    (  T x x , x x , , x x , , x x , x x , , x x   d d d d d 1 1 1 2 1 1 2    x     z z d 1 1    z z    Hence, we have 1 2          z z 1 d (    2 =    T     x z x x , x x , , x x , , x x , x x , , x x , ( ) z  d d d d d 1 1 1 2 1 1 2   z z  ( x ) T d 1   z z   d 2        z z  d d 11

  12. Polynomial Kernels The point is that: • The computation of  ( x ) T  ( z ) has complexity O ( d 2 ) • The direct computation of K ( x,z ) = ( x T z ) 2 has complexity O ( d ) Direct evaluation is more efficient by a factor of d As d goes to infinity this allows a feasible implementation BTW, you just met another kernel family • This implements polynomials of second order • In general, the family of polynomial kernels is defined as (    k =   T K x z ( , ) 1 x z , k 1,2, • I don’t even want to think about writing down  ( x ) ! 12

  13. Kernel Summary D not easy to deal with in X , apply feature transformation  : X  Z Z , 1. such that dim( Z ) >> dim( X ) Constructing and computing  (x) directly is too expensive: 2. • Write your learning algorithm in inner product form Then, instead of  (x) , we only need   ( x i ),  ( x j )  for all i and j , • which we can compute by defining an “inner product kernel” =     K x z ( , ) ( ), ( ) x z and computing K(x i ,x j ) " i,j directly • Note: the matrix     =  K K x z ( , )  i j     is called the “ Kernel matrix ” or Gram matrix Moral: Forget about  (x) and instead use K(x,z) from the start! 3. 13

  14. Question? What is a good inner product kernel? • This is a difficult question (see Prof. Lenckriet’s work) In practice, the usual recipe is: • Pick a kernel from a library of known kernels • we have already met • the linear kernel K(x,z) = x T z • the Gaussian family 2 - x z - =  K x z e ( , ) • the polynomial family (    k =   T K x z ( , ) 1 x z , k 1,2, 14

  15. Inner Product Kernel Families Why introduce simple, known kernel families ? • Obtain the benefits of a high-dimensional space without paying a price in complexity (avoid the “curse of dimensionality”). • The kernel simply adds a few parameters (e.g.,  o r k ), whereas learning it would imply introducing many parameters (up to n 2 ) How does one check whether K(x,z) is a kernel? Definition: a mapping x 2 k: X x X   X x x x x x x x x o o x x o o o o x ( x,y )  k ( x,y ) x o o o o o o x 1  is an inner product kernel if and only if x x x x x x x x x x H x o o x k(x,y) =   ( x ),  ( y )  o o o o x o o 1 o o o o x n 3 x 2 x where  : X  H , H is a vector space and < . , . > is an inner product in H 15

  16. Positive Definite Matrices Recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) x T Ax > 0, " x  0 ii) All (real) eigenvalues of A satisfy l i > 0 iii) All upper-left submatrices A k have strictly positive determinant iv) There is a matrix R with independent columns with A = R T R Upper left submatrices:   a a a   1 , 1 1 , 2 1 , 3   a a = = = 1 , 1 1 , 2    A a A A a a a   1 1 , 1 2 3 2 , 1 2 , 2 2 , 3 a a     2 , 1 2 , 2 a a a   3 , 1 3 , 2 3 , 3 16

  17. Positive definite matrices Property (iv) is particularly interesting • In  d , <x,y> = x T Ay is an inner product kernel if and only if A is positive definite (from definition of inner product). • From iv) this holds iif there is full column rank R such that A = R T R • Hence <x,y> = x T Ay = ( Rx ) T ( Ry ) =  ( x ) T  ( y ) with  :  d   d x  Rx I.e. the inner product kernel k(x,z) = x T Az ( A symmetric & positive definite) is the standard inner product in the range space of the mapping  (x) = Rx 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend