Kernel-based Methods and Support Vector Machines Larry Holder CptS - PowerPoint PPT Presentation

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 – Machine Learning School of Electrical Engineering and Computer Science Washington State University

References � Muller et al., “An Introduction to Kernel-Based Learning Algorithms,” IEEE Transactions on Neural Networks , 12(2):181-201, 2001.

Learning Problem � Estimate function f : R N � { -1,+ 1} using n training data ( x i ,y i ) sampled from P( x ,y) � Want f minimizing expected error (risk) R[f] ∫ = R [ f ] loss ( f ( x ), y ) dP ( x , y ) � P( x ,y) unknown, so compute empirical risk R emp [f] 1 n ∑ = R [ f ] loss ( f ( x ), y ) emp i i n = i 1

Overfit � Using R emp [f] to estimate R[f] for small n may lead to overfit

Overfit � Can restrict the class F of f � I.e., restrict the VC dimension h of F � Model selection � Find F such that learned f ∈ F minimizes R emp [f] ’s overestimate of R[f] � With probability 1- δ and n> h : 2 n + − δ h (ln 1 ) ln( / 4 ) h ≤ + R [ f ] R [ f ] emp n

Overfit � Tradeoff between empirical risk R emp [f] and uncertainty in estimate of R[f] Expected Risk Uncertainty Empirical Risk Complexity of F

Margins � Consider a training sample separable by the hyperplane f( x ) = ( w · x ) + b � Margin is the minimal distance of a sample to the decision surface � We can bound the VC dimension of the set of hyperplanes by bounding the margin w margin

Nonlinear Algorithms � Likely to underfit using only hyperplanes � But we can map the data to a nonlinear space and use hyperplanes there � Φ : R N � F � x � Φ ( x ) Φ

Curse of Dimensionality � Difficulty of learning increases with the dimensionality of the problem � I.e., Harder to learn with more features � But, difficulty based on complexity of learning algorithm and VC of hypothesis class � Hyperplanes are easy to learn � Still mapping to extremely high dimensional spaces makes even hyperplane learning difficult

Kernel Functions � For some feature spaces F and mappings Φ there is a “trick” for efficiently computing scalar products � Kernel functions compute scalar products in F without mapping data to F or even knowing Φ

Kernel Functions � Example: kernel k Φ → 2 3 : R R → = 2 2 ( x , x ) ( z , z , z ) ( x , 2 x x , x ) 1 2 1 2 3 1 1 2 2 Τ Φ ⋅ Φ = 2 2 2 2 ( ( x ) ( y )) ( x , 2 x x , x )( y , 2 y y , y ) 1 1 2 2 1 1 2 2 Τ = 2 (( x , x )( y , y ) ) 1 2 1 2 = ⋅ 2 ( x y ) = k ( x , y )

Kernel Functions ⎛ ⎞ − − 2 || x y || ⎜ ⎟ = Gaussian RBF : k ( x , y ) exp ⎜ ⎟ c ⎝ ⎠ ⋅ + θ d Polynomial : (( x y ) ) κ ⋅ + θ Sigmoidal : tanh( ( x y ) ) 1 Inverse multiquadr atic : − + 2 2 || x y || c

Support Vector Machines � Supervised learning ⋅ x + ≥ = y (( w ) b ) 1 , i 1 , , n K i i � Mapping to nonlinear space ⋅ Φ + ≥ = y (( w ( x ) b ) 1 , i 1 , , n (Eq. 8) K i i � Minimize (subject to Eq. 8) 1 2 min || w || 2 w b ,

Support Vector Machines � Problem: w resides in F , where computation is difficult � Solution: remove dependency on w � Introduce Lagrange multipliers � α i ≥ 0, i = 1,…,n � One for each constraint in Eq. 8 � And use kernel function

Support Vector Machines 1 n ∑ = − α ⋅ Φ + − 2 L ( w , b , α ) || w || ( y (( w ( x )) b ) 1 ) i i i 2 = i 1 ∂ L n ∑ = → α = 0 y 0 i i ∂ b = i 1 ∂ n L ∑ = → = α Φ 0 w y ( x ) i i i ∂ w = i 1 Substituting last two equations into first and replacing ( Φ (x i ) · Φ (x j )) with kernel function k(x i ,x j ) …

Support Vector Machines n 1 n ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : α ≥ = 0 , i 1 ,..., n i n ∑ α = y 0 i i = i 1 This is a quadratic optimization function.

Support Vector Machines � Once we have α , we have w and can perform classification ⎛ ⎞ n ∑ = ⎜ α Φ ⋅ Φ + ⎟ f ( x ) sgn y ( ( x ) ( x )) b i i i ⎝ ⎠ = i 1 ⎛ ⎞ n ∑ = α + ⎜ ⎟ sgn y k ( x , x ) b , where i i i ⎝ ⎠ = i 1 ⎛ ⎞ 1 n n ∑ ∑ ⎜ ⎟ = − α b y y k ( x , x ) ⎜ ⎟ i j j i j n ⎝ ⎠ = = i 1 j 1

SVMs with Noise � Until now, assuming problem is linearly separable in some space � But if noise is present, this may be a bad assumption � Solution: Introduce noise terms (slack variables ξ i ) into the classification ⋅ + ≥ − ξ ξ ≥ = y (( w x ) b ) 1 , 0 , i 1 , , n K i i i i

SVMs with Noise � Now, we want to minimize 1 n ∑ + ξ 2 min || w || C i 2 w , b , ξ = i 1 � Where C > 0 determines tradeoff between empirical error and hypothesis complexity

SVMs with Noise n n 1 ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : ≤ α ≥ = 0 C , i 1 ,..., n i n ∑ α = y 0 i i = i 1 where C is limiting the size of the Lagrange multipliers α i

Sparsity � Note that many training examples will be outside the margin w � Therefore, their optimal α i = 0 margin α = ⇒ ≥ ξ = 0 y f ( x ) 1 and 0 i i i i < α < ⇒ = ξ = 0 C y f ( x ) 1 and 0 i i i i α = ⇒ ≤ ξ ≥ C y f ( x ) 1 and 0 i i i i � This reduces the optimization problem from n variables down to the number of examples on or inside the margin

Kernel Methods � Fisher’s linear discriminant � Find a linear projection of the feature space such that classes are well separated � “Well separated” defined as a large difference in the means and a small variance along the discriminant � Can be solved using kernel methods to find nonlinear discriminants

Applications � Optical pattern and object recognition � Invariant SVM achieved best error rate (0.6%) on USPS handwritten digit recognition problem � Better than humans (2.5%) � Text categorization � Time-series prediction

Applications � Gene expression profile analysis � DNA and protein analysis � SVM method (13%) of classifying DNA translation initiation sites outperforms best neural network (15%) � Virtual SVMs, incorporating prior biological knowledge, reached 11-12% error rate

Kernel Methods for Unsupervised Learning � Principal Components Analysis (PCA) used in unsupervised learning � PCA is a linear method � Kernel-based PCA can achieve non-linear components using standard kernel techniques � Application to USPS data to reduce noise indicated a factor of 8 performance improvement over linear PCA method

Summary � (+ ) Kernel-based methods allow linear-speed learning in non-linear spaces � (+ ) Support vector machines ignore all but the most differentiating training data (those on or inside the margin) � (+ ) Kernel-based methods and SVMs in particular, are among the best performing classifiers on many learning problems � (-) Choosing an appropriate kernel can be difficult � (-) High dimensionality of original learning problem can still be a computational bottleneck

Kernel-based Methods and Support Vector Machines Larry Holder CptS - PowerPoint PPT Presentation

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning School of Electrical Engineering and Computer Science Washington State University References Muller et al., An Introduction to Kernel-Based

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Kernel-based Methods and Support Vector Machines Larry Holder CSE 6363 Machine Learning

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Classification from Pairwise Similarity and Unlabeled Data Han Bao 1,2 , Gang Niu 2 , Masashi

Lecture 2: Linear Regression Jan 27th 2020 Lecturer: Steven Wu Scribe: Steven Wu A curious

learning to compare using operator-valued large-margin classiers andreas maurer a binary

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

Presenting empirical research 1 Goals Enough info to be replicable Enough info for

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Authors: Junyoung

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Kernel-based Methods and Support Vector Machines Larry Holder CptS - PowerPoint PPT Presentation

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning School of Electrical Engineering and Computer Science Washington State University References Muller et al., An Introduction to Kernel-Based

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Kernel-based Methods and Support Vector Machines Larry Holder CSE 6363 Machine Learning

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Classification from Pairwise Similarity and Unlabeled Data Han Bao 1,2 , Gang Niu 2 , Masashi

Lecture 2: Linear Regression Jan 27th 2020 Lecturer: Steven Wu Scribe: Steven Wu A curious

learning to compare using operator-valued large-margin classiers andreas maurer a binary

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

Presenting empirical research 1 Goals Enough info to be replicable Enough info for

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Authors: Junyoung

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David