Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 — UCSD — ECE 174A

Inner Product Matrix & PCA Given the centered data matrix X c : • 1) Construct the inner product matrix K c = X c T X c • 2) Compute its eigendecomposition (  2 , M ) PCA: For a covariance matrix  = L T • Principal Components are given by  = X c M  -1 • Principle Values are given by L 1/2 = (1 /  n )  • Projection of the centered data onto the principal components is given by - -  =  =  T T 1 1 X X X M K M c c c c This allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product (inner product) matrix K c 2

The Inner Product Form This turns out to be the case for many learning algorithms If you manipulate expressions a little bit, you can often write them in “dot product form” Definition: a learning algorithm is in inner product form if, given a training data set D = {( x 1 ,y 1 ) , ..., ( x n ,y n )} , it only depends on the points X i through their inner products i  X i , X j  = X i T X j For example, let’s look at k -means 3

K-means Clustering We saw that the k-means algorithm iterates between • 1) (re-) Classification: 2 = -  * ( ) i x argmin x i i • 2) (re-) Estimation: =  1    new ( ) i x i j n j Note that: (  (  2 T -  = -  -  x x x i i i = -     T T T x x 2 x i i i 4

K-means Clustering Combining this expansion with the sample mean formula,  =  1 ( ) i x i j n j allows us to write the distance between a data sample x k and the class center  i as a function of the inner products  x i , x j  = x i T x j : 2 1   2 -  = -  T T ( ) i ( ) i T ( ) i x x x x x x x k i k k k j j l 2 n n j jl 5

“The Kernel Trick” x 2 Why is this interesting? x x x x x x Consider the following transformation x x o o x x of the feature space: o o o o x x o o o o • Introduce a mapping to a “better” o o x 1 (i.e., linearly separable) feature space   : X  Z where, generally, dim( Z ) > dim( X ) . x x x x x x • If a classification algorithm only depends on x x x x the data through inner products then, in the x o o x o o transformed space, it depends on o o x o 1 o o o  ( (   o o (  (     =   T x , x x x x n i j i j x 2 x 3 6

The Inner Product Implementation In the transformed space, the learning algorithms only requires inner products   ( x i ),  ( x j )  =  ( x j ) T  ( x i ) Note that we do not need to store the  ( x j ), but only the n 2 (scalar) component values of the inner product matrix Interestingly, this holds even if  ( x ) takes its value in an infinite dimensional space. • We get a reduction from infinity to n 2 ! • There is, however, still one problem: • When  ( x j ) is infinite dimensional the computation of the inner product   ( x i ),  ( x j )  looks impossible. 7

“The Kernel Trick” “ Instead of defining  ( x ), then computing  ( x i ) for each i, and then computing   ( x i ),  ( x j )  for each pair (i,j), simply define a kernel function def =     K x z ( , ) ( ), ( ) x z and work with it directly .” K(x,z) is called an inner product or dot-product kernel Since we only use the kernel, why bother to define  ( x )? Just define the kernel K(x,z) directly! Then we never have to deal with the complexity of  ( x ). This is usually called “ the kernel trick ” 8

Important Questions How do I know that if I pick a function bivariate function K(x,z), it is actually equivalent to an inner product? • Answer: In fact, in general it is not. (More about this later.) If it is, how do I know what  ( x ) is? • Answer: you may never know. E.g. the Gaussian kernel 2 - x z - = =      K x z ( , ) e ( ), ( ) x z is a very popular choice. But, it is not obvious what  ( x ) is. However, on the positive side, we do not need to know how to choose  ( x ) . Choosing an admissible kernel K(x,z) is sufficient. Why is it that using K(x,z) is easier/better? • Answer: Complexity management . let’s look at an example. 9

Polynomial Kernels d , consider the square of the inner product In between two vectors:   2     (  d d d    2 = = = T       x z x z x z x z i i i i j j      = = = i 1 i 1 j 1 d d  = x x z z i j i j = = i 1 j 1 =     x x z z x x z z x x z z d d 1 1 1 1 1 2 1 2 1 1      x x z z x x z z x x z z 2 1 2 1 2 2 2 2 2 d 2 d     x x z z x x z z x x z z d 1 d 1 d 2 d 2 d d d d 10

Polynomial Kernels This can be written as (  2 = =   T T K x z ( , ) x z ( ) x ( ) z   2 d d with :   x 1    (  T x x , x x , , x x , , x x , x x , , x x   d d d d d 1 1 1 2 1 1 2    x     z z d 1 1    z z    Hence, we have 1 2          z z 1 d (    2 =    T     x z x x , x x , , x x , , x x , x x , , x x , ( ) z  d d d d d 1 1 1 2 1 1 2   z z  ( x ) T d 1   z z   d 2        z z  d d 11

Polynomial Kernels The point is that: • The computation of  ( x ) T  ( z ) has complexity O ( d 2 ) • The direct computation of K ( x,z ) = ( x T z ) 2 has complexity O ( d ) Direct evaluation is more efficient by a factor of d As d goes to infinity this allows a feasible implementation BTW, you just met another kernel family • This implements polynomials of second order • In general, the family of polynomial kernels is defined as (    k =   T K x z ( , ) 1 x z , k 1,2, • I don’t even want to think about writing down  ( x ) ! 12

Kernel Summary D not easy to deal with in X , apply feature transformation  : X  Z Z , 1. such that dim( Z ) >> dim( X ) Constructing and computing  (x) directly is too expensive: 2. • Write your learning algorithm in inner product form Then, instead of  (x) , we only need   ( x i ),  ( x j )  for all i and j , • which we can compute by defining an “inner product kernel” =     K x z ( , ) ( ), ( ) x z and computing K(x i ,x j ) " i,j directly • Note: the matrix     =  K K x z ( , )  i j     is called the “ Kernel matrix ” or Gram matrix Moral: Forget about  (x) and instead use K(x,z) from the start! 3. 13

Question? What is a good inner product kernel? • This is a difficult question (see Prof. Lenckriet’s work) In practice, the usual recipe is: • Pick a kernel from a library of known kernels • we have already met • the linear kernel K(x,z) = x T z • the Gaussian family 2 - x z - =  K x z e ( , ) • the polynomial family (    k =   T K x z ( , ) 1 x z , k 1,2, 14

Inner Product Kernel Families Why introduce simple, known kernel families ? • Obtain the benefits of a high-dimensional space without paying a price in complexity (avoid the “curse of dimensionality”). • The kernel simply adds a few parameters (e.g.,  o r k ), whereas learning it would imply introducing many parameters (up to n 2 ) How does one check whether K(x,z) is a kernel? Definition: a mapping x 2 k: X x X   X x x x x x x x x o o x x o o o o x ( x,y )  k ( x,y ) x o o o o o o x 1  is an inner product kernel if and only if x x x x x x x x x x H x o o x k(x,y) =   ( x ),  ( y )  o o o o x o o 1 o o o o x n 3 x 2 x where  : X  H , H is a vector space and < . , . > is an inner product in H 15

Positive Definite Matrices Recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) x T Ax > 0, " x  0 ii) All (real) eigenvalues of A satisfy l i > 0 iii) All upper-left submatrices A k have strictly positive determinant iv) There is a matrix R with independent columns with A = R T R Upper left submatrices:   a a a   1 , 1 1 , 2 1 , 3   a a = = = 1 , 1 1 , 2    A a A A a a a   1 1 , 1 2 3 2 , 1 2 , 2 2 , 3 a a     2 , 1 2 , 2 a a a   3 , 1 3 , 2 3 , 3 16

Positive definite matrices Property (iv) is particularly interesting • In  d , <x,y> = x T Ay is an inner product kernel if and only if A is positive definite (from definition of inner product). • From iv) this holds iif there is full column rank R such that A = R T R • Hence <x,y> = x T Ay = ( Rx ) T ( Ry ) =  ( x ) T  ( y ) with  :  d   d x  Rx I.e. the inner product kernel k(x,z) = x T Az ( A symmetric & positive definite) is the standard inner product in the range space of the mapping  (x) = Rx 17

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE 174A Inner Product Matrix & PCA Given the centered data matrix X c : 1) Construct the inner product matrix K c = X c T X c 2) Compute its

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Fixed Parameter Algorithms and Kernelization Saket Saurabh The Institute of Mathematica Sciences,

Meta-Kernelization with Structural Parameters Robert Ganian Friedrich Slivovsky Stefan Szeider

Fixed-Parameter Algorithms, IA166 Sebastian Ordyniak Faculty of Informatics Masaryk University

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

Mee eeting E Employers Sharing data, stories, and tools February 9, 2017 Advancing your

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

Positive semidefinite rank Pablo A. Parrilo Laboratory for Information and Decision Systems

Symmetric indefinite systems, positive definite preconditioning, and interior eigenvalues Eugene

Determinacy for the complex moment problem via positive definite extensions Dariusz Cicho n