Kernels Course of Machine Learning Master Degree in Computer - PowerPoint PPT Presentation

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019

Idea • Thus far, we have been assuming that each object that we deal with • For certain kinds of objects (text document, protein sequence, parse tree, etc.) it is not clear how to best represent them in this way 1. first approach: define a generative model of data (with latent variables) and define an object as the inferred values of latent variables 2. second approach: do not rely on vector representation, but just assume a similarity measure between objects is defined 2 R d can be represented as a fixed-size feature vector x ∈ I

Representation by pairwise comparison Idea such that algorithm will work for any type of data (vectors, strings, …) 3 • Define a comparison function κ : χ × χ �→ I R • Represent a set of data items x 1 , . . . x n by the n × n Gram matrix G G ij = κ ( x i , x j ) • G is always an n × n matrix, whatever the nature of data: the same

Kernel definition 4 Given a set χ , a function κ : χ 2 �→ I R is a kernel on χ if there exists a Hilbert space H (essentially, a vector space with dot product · ) and a map φ : χ �→ H such that for all x 1 , x 2 ∈ χ we have κ ( x 1 , x 2 ) = φ ( x 1 ) · φ ( x 2 ) R d for some We shall consider the particular but common case when H = I d > 0 , φ ( x ) = ( ϕ 1 ( x ) , . . . , ϕ d ( x ) and φ ( x 1 ) · φ ( x 2 ) = φ ( x 1 ) T φ ( x 2 ) φ is called a feature map a H a feature space of κ

Kernel definition Positive semidefinitess 5 Positive definitess of κ is a relevant property in this framework. Given a set χ , a function κ : χ 2 �→ I R is positive semidefinite if for all N , ( x 1 , . . . , x n ) ∈ χ n the corresponding Gram matrix is positive n ∈ I semidefinite, that is z T Gz ≥ 0 for all vectors z ∈ I R n

Why is positive semidefinitess relevant? 6 Let κ : χ × χ �→ I R . Then κ is a kernel iff for all sets { x 1 , x 2 , . . . , x n } the corresponding Gram matrix G is symmetric and positive semidefinite Only if : G ij = ϕ ( x i ) T ϕ ( x j ) then clearly G ij = G ji . Moreover for any R d z ∈ I d d d d z T Gz = z i φ ( x i ) T φ ( x j ) z j ∑ ∑ ∑ ∑ z i G ij z j = i =1 j =1 i =1 j =1 ( d d d n d d ) ∑ ∑ ∑ ∑ ∑ ∑ = z i ϕ k ( x i ) ϕ k ( x j ) z j = z i ϕ k ( x i ) ϕ k ( x j ) z j i =1 j =1 i =1 j =1 k =1 k =1 ( d ) 2 d ∑ ∑ = z i ϕ k ( x i ) ≥ 0 i =1 k =1

Why are positive definite kernels relevant? 7 compute an eigenvector decomposition If : Given { x 1 , x 2 , . . . , x n } if G is positive definite it is possible to G = U T ΛU where Λ is the diagonal matrix of eigenvalues λ i > 0 and the columns u 1 , . . . , u n of U are the corresponding eigenvectors. Then, 2 u i ) T ( Λ 1 1 2 u j ) G ij = ( Λ 1 2 u i we get Then if we define ϕ ( x i ) = Λ κ ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) = G ij This results is valid only wrt the domain { x 1 , x 2 , . . . , x n } . For the general case, consider n → ∞ (as for example in gaussian processes)

Why are positive definite kernels relevant? Using positive definite kernels allows to apply the kernel trick wherever useful. Kernel trick Any algorithm which processes finite-dimensional vectors in such a way to consider only pairwise dot products can be applied to higher (possibly infinite) dimensional vectors by replacing each dot product by a suitable application of a positive definite kernel. • Many practical applications • Vectors in the new space are manipulated only implicitly, through pairwise dot products, computed by evaluating the kernel function on the original pair of vectors Example: Support vector machines. Also, many linear models for regression involving only dot products. 8 and classification can be reformulated in terms of a dual representation

Dual representations: example Regularized sum of squares in regression with predefined basis function 9 φ ( x ) n ) 2 + λ J ( w ) = 1 ( w T φ ( x i ) − t i 2 w T w ∑ 2 i =1 = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w R n × d it is Φ ij = ϕ j ( x i ) where by definition of Φ ∈ I Setting ∂J ( w ) = 0 , the resulting solution is ∂ w w = ( ΦΦ T + λ I d ) − 1 Φ T t = Φ T ( ΦΦ T + λ I n ) − 1 t ˆ R r × c it is since it is possible to prove that for any matrix A ∈ I ( A T A + λ I r ) − 1 A T = A T ( AA T + λ I c ) − 1

Dual representations: example 10 If we define the dual variables a = ( ΦΦ T + λ I n ) − 1 t , we get w = Φ T a . By substituting Φ T a to w we express the cost function in terms of a , instead of w , introducing a dual formulation of J . J ( a ) = 1 2 a T ΦΦ T ΦΦ T a + 1 2 t T t − a T ΦΦ T t + λ 2 a T ΦΦ T a = 1 2 a T GGa + 1 2 t T t − a T Gt + λ 2 a T Ga where G = ΦΦ T is the Gram matrix, such that by definition k =1 ϕ k ( x i ) ϕ k ( x j ) = φ ( x i ) T φ ( x j ) G ij = ∑ d

Dual representations: example We can use this to make predictions in a different way The prediction can be done in terms of dot products between different pairs where 11 Setting the gradient of ∂J ( a ) = 0 it results ∂ a a = ( G + I λ n ) − 1 t ˆ y ( x ) = w T φ ( x ) = a T Φ φ ( x ) = t T ( G + I λ n ) − 1 Φ φ ( x ) = k ( x )( G + I λ n ) − 1 t k ( x ) = φ ( x ) T Φ = ( φ ( x 1 ) T φ ( x ) , . . . , φ ( x T n φ ( x )) T = ( κ ( x 1 , x ) , . . . , κ ( x n , x )) T = ( κ 1 ( x ) , . . . , κ n ( x )) T of φ ( x ) , or in terms of the kernel function κ ( x i , x j ) = φ ( x i ) T φ ( x j )

Dual representations: another example items that have been considered as misclassified by the algorithm, • As well known, a perceptron is a linear classifier with prediction considered. where each item is weighted by the number of times it has been 12 y ( x ) = w T x • Its update rule is: If x i is misclassified, that is w T x i t i < 0 , then w := w + t i x i • If we assume a zero initial value for all w k , then w is the sum of all • We may then define a dual formulation by setting w = ∑ n k =1 a k x k , which results in prediction y ( x ) = ∑ n k =1 a k x T k x • and update rule: if x i is misclassified, that is ∑ n k =1 a k x T k x i < 0 , then a i := a i + 1 • a kernelized perceptron can be defined with k =1 a k φ ( x k ) T φ ( x ) or with y ( x ) = ∑ n y ( x ) = ∑ n k =1 a k κ ( x k , x ) , by just using a positive definite kernel κ

Kernelization: one more example Why referring to the dual representation? , even infinite). dimension (much larger than makes it possible to implicitly use feature space of very high : this base functions , and not to the set of kernel function • However, the dual approach makes it possible to refer only to the 13 • We do not explicitly compute vectors • This is a kernelized nearest-neighbor classifier and we obtain: • We can now replace the dot products by a valid positive definite kernel the Euclidean distance is considered • The k -nn classifier selects the label of the nearest neighbor: assume || x i − x j || 2 = x T i x i + x T j x j − 2 x T i x j d ( x i , x j ) 2 = κ ( x i , x i ) + κ ( x j , x j ) − 2 κ ( x i , x j ) • While in the original formulation of linear regression w can be derived by inverting the m × m matrix Φ T Φ , in the dual formulation computing a requires inverting the n × n matrix G + I λ . • Since usually n ≫ m , this seems to lead to a loss of efficiency.

Dealing with kernels method to define them must be applied. 14 R d are positive definite kernel, some Since not all functions f : χ �→ I • the straighforward way is just to define a basis function φ and define κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) . κ is a positive definite kernel since 1. ϕ ( x 1 ) T ϕ ( x 2 ) = ϕ ( x 2 ) T ϕ ( x 1 ) j =1 c i c j ϕ ( x i ) T ϕ ( x j ) = 2. ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n i =1 i =1 i =1 c i ϕ ( x i ) ∥ 2 ≥ 0 ∥ ∑ n

Dealing with kernels order to ensure that such function is a valid kernel, apply Mercer’s 15 • a second method defines a possible kernel function κ directly: in theorem and prove that κ is a positive definite kernel by showing it is simmetric and the corresponding Gram matrix G is positive definite for all possible sets of items. In this case we do not define φ

A simple positive definite kernel is a positive definite kernel. In fact, 16 R 2 �→ I Let χ = I R : the function κ : I R defined as κ ( x 1 , x 2 ) = x 1 x 2 • x 1 x 2 = x 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x i x j = i =1 i =1 i =1 c i x i ) 2 ) ≥ 0 (∑ n

Another simple positive definite kernel is a positive definite kernel. In fact, 17 R d : the function κ : χ 2 �→ I Let χ = I R defined as κ ( x 1 , x 2 ) = x T 1 x 2 • x T 1 x 2 = x T 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x T i x j = i =1 i =1 i =1 c i x i ∥ 2 ≥ 0 ∥ ∑ n

Dealing with kernels 18 • a third method defines again a possible kernel function κ directly: in order to ensure that such function is a valid kernel, a basis function φ must be found such that κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) for all x 1 , x 2

Kernels Course of Machine Learning Master Degree in Computer - PowerPoint PPT Presentation

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019 Idea Thus far, we have been assuming that each object that we deal with For certain kinds of objects (text document, protein sequence,

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Variably scaled kernels M. Bozzini jointed with L. Lenarduzzi, M. Rossini, R. Schaback Maia

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Why the Best Predictive What Do We Mean by . . . Models Are Often Different Main Result: . . .

The formation of gas dwarfs and rocky planets a case for the new DISPATCH Code ke Nordlund

Symmetry breaking Suppose the system had SU(3) symmetry initially. Let perturbation break the

A Crisis in Mathematics? A Crisis in Mathematics? Alan Bundy University of Edinburgh October

L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446

Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Beyond

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft