Machine learning theory Kernel methods Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020

Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24

Motivation

Introduction ◮ Most of learning algorithms are linear and are not able to classify non-linearly-separable data. ◮ How do you separate these two classes? ◮ Linear separation impossible in most problems. ◮ Non-linear mapping from input space to high-dimensional feature space: φ : X �→ H . φ ◮ Generalization ability: independent of dim ( H ), depends only on ρ and m . 2/24

Kernel methods

Ideas of kernels ◮ Most datasets are not linearly separable, for example ◮ Instances that are not linearly separable in R , may be linearly separable in R 2 by using mapping φ ( x ) = ( x , x 2 ). ◮ In this case, we have two solutions ◮ Increase dimensionality of data set by introducing mapping φ . ◮ Use a more complex model for classifier. 3/24

Ideas of kernels ◮ To classify the non-linearly separable dataset, we use mapping φ . ◮ For example, let x = ( x 1 , x 2 ) T , z = ( z 1 , z 2 . z 3 ) T , and φ : R 2 → R 3 . √ ◮ If we use mapping z = φ ( x ) = ( x 2 2 x 1 x 2 , x 2 2 ) T , the dataset will be linearly separable in R 3 . 1 , ◮ Mapping dataset to higher dimensions has two major problems. ◮ In high dimensions, there is risk of over-fitting. ◮ In high dimensions, we have more computational cost. ◮ The generalization capability in higher dimension is ensured by using large margin classifiers. ◮ The mapping is an implicit mapping not explicit. 4/24

Kernels ◮ Kernel methods avoid explicitly transforming each point x in the input space into the mapped point φ ( x ) in the feature space. ◮ Instead, the inputs are represented via their m × m pairwise similarity values. ◮ The similarity function, called a kernel , is chosen so that it represents a dot product in some high-dimensional feature space. ◮ The kernel can be computed without directly constructing φ . ◮ The pairwise similarity values between points in S represented via the m × m kernel matrix, defined as   k ( x 1 , x 1 ) k ( x 1 , x 2 ) · · · k ( x 1 , x m ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) · · · k ( x 2 , x m )     K = . . . ...   . . .  . . .    k ( x m , x 1 ) k ( x m , x 2 ) · · · k ( x m , x m ) ◮ Function K ( x i , x j ) is called kernel function and defined as Definition (Kernel) Function K : X × X �→ R is a kernel if 1. ∃ φ : X �→ R N such that K ( x , y ) = � φ ( x ) , φ ( y ) � . 2. Range of φ is called the feature space. 3. N can be very large. 5/24

Kernels (example) √ ◮ Let φ : R 2 �→ R 3 be defined as φ ( x ) = ( x 2 1 , x 2 2 , 2 x 1 x 2 ). ◮ Then � φ ( x ) , φ ( z ) � equals to √ √ � � ( x 2 1 , x 2 2 x 1 x 2 ) , ( z 2 1 , z 2 � φ ( x ) , φ ( z ) � = 2 , 2 , 2 z 1 z 2 ) = ( x 1 z 1 + x 2 z 2 ) 2 = ( � x , z � ) 2 K x, z � x ⋅ z � = K ( x , z ) . 2 , 𝑦 2 2 , 𝑦 1 , 𝑦 2 → Φ 𝑦 � �𝑦 1 2𝑦 1 𝑦 2 � ◮ The above mapping can be described Φ x 2 X X X X X X X X X X X X O X O O X X X x 1 O O O O O X X O O z 1 O O O X O O X X O X X X X X O X X z 3 X X X X X X X X Input space feature space 6/24

Kernels (example) √ ◮ Let φ 1 : R 2 �→ R 3 be defined as φ ( x ) = ( x 2 1 , x 2 2 , 2 x 1 x 2 ). ◮ Then � φ 1 ( x ) , φ 1 ( z ) � equals to √ √ � � ( x 2 1 , x 2 2 x 1 x 2 ) , ( z 2 1 , z 2 � φ 1 ( x ) , φ 1 ( z ) � = 2 , 2 , 2 z 1 z 2 ) = ( x 1 z 1 + x 2 z 2 ) 2 = ( � x , z � ) 2 = K ( x , z ) . ◮ Let φ 2 : R 2 �→ R 4 be defined as φ ( x ) = ( x 2 1 , x 2 2 , x 1 x 2 , x 2 x 1 ). ◮ Then � φ 2 ( x ) , φ 2 ( z ) � equals to ��ick��. � � ( x 2 1 , x 2 2 , x 1 x 2 , x 2 x 1 ) , ( z 2 1 , z 2 � φ 2 ( x ) , φ 2 ( z ) � = 2 , z 1 z 2 , z 2 z 1 ) ϕ = ( � x , z � ) 2 = K ( x , z ) . 𝑒 𝑙 �, � � � � � � � � � ⋅ � � � ◮ Feature space can grow really large and really quickly. � , � � � � … � � , � � � � � … � �� – � � ◮ Let K be a kernel K ( x , z ) = ( � x , z � ) d = � φ ( x ) , φ ( z ) � � ◮ The dimension of feature space equals to � d + n − 1 � . � 𝑒 � � � 1 ! d 𝑒 � � � 1 ◮ Let n = 100 , d = 6, there are1.6 billion terms. 𝑒 𝑒! � � 1 ! ms – 𝑒 � 6, � � 100, 7/24 𝑃 � 𝑑��𝑏�𝑗��! 𝑙 �, � � � � � � � � � ⋅ � �

Mercer’s condition ◮ The kernel methods have the following benefits. Efficiency: K is often more efficient to compute than φ and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of φ is guaranteed (Mercer’s condition). Theorem (Mercer’s condition) c ( x ) 2 dx < ∞ ), other than the zero function, the � For all functions c that are square integrable (i.e., following property holds: � � c ( x ) K ( x , z ) c ( z ) dxdz ≥ 0 . ◮ This Theorem states that K : X × X �→ R is a kernel if matrix K is positive semi-definite (PSD). ◮ Suppose x , z ∈ R n and consider the following kernel K ( x , z ) = ( � x , z � ) 2 ◮ It is a valid kernel because � n � � n � � � K ( x , z ) = x i z i x j z j i =1 j =1 n n � � = ( x i x j ) ( z i z j ) = � φ ( x ) , φ ( z ) � i =1 j =1 where the mapping φ for n = 2 is φ ( x ) = ( x 1 x 1 , x 1 x 2 , x 2 x 1 , x 2 x 2 ) T 8/24

Polynomial kernels (example) ◮ Consider the polynomial kernel K ( x , z ) = ( � x , z � + c ) d for all x , z ∈ R n . ◮ For n = 2 and d = 2, K ( x , z ) = ( x 1 z 1 + x 2 y 2 + c ) 2 √ √ √ √ √ √ �� x 2 1 , x 2 z 2 1 , z 2 = 2 , 2 x 1 x 2 , 2 cx 1 , 2 cx 2 , c , 2 , 2 z 1 z 2 , 2 cz 1 , 2 cz 2 , c ◮ Using second-degree polynomial kernel with c = 1: √ x 2 2 x 1 x 2 √ √ √ ( − 1 , 1) (1 , 1) √ √ √ (1 , 1 , + 2 , − 2 , − 2 , 1) (1 , 1 , + 2 , + 2 , + 2 , 1) √ 2 x 1 x 1 √ √ √ (1 , − 1) √ √ √ ( − 1 , − 1) (1 , 1 , − 2 , − 2 , + 2 , 1) (1 , 1 , − 2 , + 2 , − 2 , 1) ◮ The left data is not linearly separable but the right one is. 9/24

Some valid kernels ◮ Some valid kernel functions ◮ Polynomial kernels consider the kernel defined by K ( x , z ) = ( � x , z � + c ) d d is the degree of the polynomial and specified by the user and c is a constant. ◮ Radial basis function kernels consider the kernel defined by � � − � x − z � 2 K ( x , z ) = exp 2 σ 2 The width σ is specified by the user. This kernel corresponds to an infinite dimensional mapping φ . ◮ Sigmoid kernel consider the kernel defined by K ( x , z ) = tanh ( β 0 � x , z � + β 1 ) This kernel only meets Mercer’s condition for certain values of β 0 and β 1 . ◮ Homework: Please prove VC-dimension of the above kernels. 10/24

Reproducing kernel Hilbert space ◮ We give the crucial property of PDS kernels, which is to induce an inner product in a Hilbert space. Lemma (Cauchy-Schwarz inequality for PDS kernels) Let K be a PDS kernel matrix. Then, for any x , z ∈ X , K ( x , z ) 2 ≤ K ( x , x ) K ( z , z ) Theorem (Reproducing kernel Hilbert space (RKHS)) Let K : X × X �→ R be a PDS kernel. Then, there exists a Hilbert space H and a mapping φ from X to H such that for all x , y ∈ X K ( x , y ) = � φ ( x ) , φ ( y ) � . ◮ This Theorem implies that PDS kernels can be used to implicitly define a feature space. 11/24

Normalized kernel ◮ For any kernel K , we can associate a normalized kernel K n defined by  0 if (( K ( x , x ) = 0) ∨ ( K ( z , z ) = 0))     K n ( x , z ) = K ( x , z ) otherwise    � K ( x , x ) K ( z , z )  Lemma (Normalized PDS kernels) Let K be a PDS kernel. Then, the normalized kernel K n associated to K is PDS. Proof. 1. Let { x 1 , . . . , x m } ⊆ X and let c be an arbitrary vector in R n . 2. We will show that � m i , j =1 c i c j K n ( x i , x j ) ≥ 0. 3. By Lemma Cauchy-Schwarz inequality for PDS kernels, if K ( x i , x i ) = 0, then K ( x i , x j ) = 0 and thus K n ( x i , x i ) = 0 for all j ∈ { 1 , 2 , . . . , m } . 4. We can assume that K ( x i , x i ) > 0 for all i ∈ { 1 , 2 , . . . , m } . 5. Then, the sum can be rewritten as follows: 2 m m m � m � � � c i c j φ ( x i ) , φ ( x j ) c i c j K ( x i , x j ) c i φ ( x i ) � � � � � � c i c j K n ( x i , x j ) = = = ≥ 0 . � � � � � � φ ( x i ) � H . � φ ( x j ) � � φ ( x i ) � H � K ( x i , x i ) K ( x j , x j ) � H � � i , j =1 i , j =1 i , j =1 i =1 H 12/24

Closure properties of PDS kernels ◮ The following theorem provides closure guarantees for all of these operations. Theorem (Closure properties of PDS kernels) PDS kernels are closed under 1. sum 2. product 3. tensor product 4. pointwise limit k =1 a k x k with a k ≥ 0 for all k ∈ N . 5. composition with a power series � ∞ Proof. We only proof the closeness under sum. Consider two valid kernel matrices K 1 and K 2 . 1. For any c ∈ R m , we have c T K 1 c ≥ 0 and c T K 2 c ≥ 0. 2. This implies that c T K 1 c + c T K 2 c ≥ 0. 3. Hence, we have c T ( K 1 + K 2 ) c ≥ 0. 4. Let K = K 1 + K 2 , which is a valid kernel. ◮ Homework: Please prove other closure properties of PDS kernels. 13/24

Machine learning theory Kernel methods Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020 Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Spectral regularization methods for statistical inverse learning problems G. Blanchard

Meta-parameters of kernel methods and their optimization Petra Vidnerov Roman Neruda

PRESENTATION ON: A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION EXTRACTION Hypothesis

12.1 Active Learning: A Review When learning, it may be the case that getting the true labels of

Sambuz

Useful Links

Newsletter

Mail Us

Machine learning theory Kernel methods Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020 Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &amp;

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Spectral regularization methods for statistical inverse learning problems G. Blanchard

Meta-parameters of kernel methods and their optimization Petra Vidnerov Roman Neruda

PRESENTATION ON: A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION EXTRACTION Hypothesis

12.1 Active Learning: A Review When learning, it may be the case that getting the true labels of

Sambuz

Useful Links

Newsletter

Mail Us

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &