Support Vector Machines (II): Non-linear SVMs LING 572 Advanced - PowerPoint PPT Presentation

Support Vector Machines (II):   Non-linear SVMs LING 572 Advanced Statistical Methods for NLP February 18, 2020 1 Based on F. Xia ‘18

Outline ● Linear SVM ● Maximizing the margin ● Soft margin ● Nonlinear SVM ● Kernel trick ● A case study ● Handling multi-class problems 2

Non-linear SVM 3

Highlights ● Problem: Some data are not linearly separable. ● Intuition: Transform the data to a high dimensional space Input space Feature space 4

Example: Two spirals Separated by a hyperplane in feature space (Gaussian kernels) 5

Feature space ● Learning a non-linear classifier using SVM: ● Define ϕ ● Calculate ϕ (x) for each training example ● Find a linear SVM in the feature space. ● Problems: ● Feature space can be high dimensional or even have infinite dimensions. ● Calculating ϕ (x) is very inefficient and even impossible. ● Curse of dimensionality 6

Kernels ● Kernels are similarity functions that return inner products between the images of data points. ● Kernels can often be computed efficiently even for very high dimensional spaces. ● Choosing K is equivalent to choosing ϕ . ➔ the feature space is implicitly defined by K 7

An example 8

An example** 9

Credit: Michael Jordan 10

Another example** 11

The kernel trick ● No need to know what ϕ is and what the feature space is. ● No need to explicitly map the data to the feature space. ● Define a kernel function K, and replace the dot product <x,z> with a kernel function K(x,z) in both training and testing. 12

Training (**) Maximize Subject to Non-linear SVM 13

Decoding Linear SVM: (without mapping) Non-linear SVM: could be infinite dimensional 14

Kernel vs. features 15

A tree kernel 16

Common kernel functions ● Linear : ● Polynomial: ● Radial basis function (RBF): ● Sigmoid: For the tanh function, see https://www.youtube.com/watch?v=er_tQOBgo-I 17

Polynomial kernel ● Allows us to model feature conjunctions (up to the order of the polynomial). ● Ex: ● Original feature: single words ● Quadratic kernel: word pairs, e.g., “ethnic” and “cleansing”, “Jordan” and “Chicago” 19

RBF Kernel Source: Chris Albon 20

Other kernels ● Kernels for ● trees ● sequences ● sets ● graphs ● general structures ● … ● A tree kernel example in reading #3 21

The choice of kernel function ● Given a function, we can test whether it is a kernel function by using Mercer’s theorem (see “Additional slides”). ● Different kernel functions could lead to very different results. ● Need some prior knowledge in order to choose a good kernel. 22

Summary so far ● Find the hyperplane that maximizes the margin. ● Introduce soft margin to deal with noisy data ● Implicitly map the data to a higher dimensional space to deal with non-linear problems. ● The kernel trick allows infinite number of features and efficient computation of the dot product in the feature space. ● The choice of the kernel function is important. 23

MaxEnt vs. SVM MaxEnt SVM Maximize P(Y|X, λ ) Modeling Maximize the margin Learn λ i for each feature Learn α i for each Training function training instance and b Calculate the sign of Decoding Calculate P(y|x) f(x). It is not prob Kernel Features Things to Regularization Regularization decide Training algorithm Training algorithm Binarization 24

More info ● https://en.wikipedia.org/wiki/Kernel_method ● Tutorials: http://www.svms.org/tutorials/ ● https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it- important-98a98db0961d 25

Additional slides 26

Linear kernel ● The map ϕ is linear. ● The kernel adjusts the weight of the features according to their importance. 27

The Kernel Matrix   (a.k.a. the Gram matrix) K(1,1) K(1,2) K(1,3) … K(1,m) K(2,1) K(2,2) K(2,3) … K(2,m) … … K(m,1) K(m,2) K(m,3) … K(m,m) K(i,j) means K(x i ,x j ) Where x i means the i-th training instance. 28

Mercer’s Theorem ● The kernel matrix is symmetric positive definite. ● Any symmetric, positive definite matrix can be regarded as a kernel matrix;   that is, there exists a ϕ such that K(x, z) = < ϕ (x), ϕ (z)> 29

Making kernels ● The set of kernels is closed under some operations. For instance, if K 1 and K 2 are kernels, so are the following: ● K 1 +K 2 ● cK 1 and cK 2 for c > 0 ● cK 1 +dK 2 for c > 0 and d > 0 ● One can make complicated kernels from simple ones 30

Support Vector Machines (II): Non-linear SVMs LING 572 Advanced - PowerPoint PPT Presentation

Support Vector Machines (II): Non-linear SVMs LING 572 Advanced Statistical Methods for NLP February 18, 2020 1 Based on F. Xia 18 Outline Linear SVM Maximizing the margin Soft margin Nonlinear SVM Kernel

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Lecture 20: Support Vector Machines (SVMs) CS109A Introduction to Data Science Pavlos Protopapas

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support vector machines (SVMs) Lecture 3 David Sontag New York University Slides adapted from

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex

MSc in Computer Engineering, Cybersecurity and Artificial Intelligence Course FDE , a.a.

Determinacy for the complex moment problem via positive definite extensions Dariusz Cicho n

Symmetric indefinite systems, positive definite preconditioning, and interior eigenvalues Eugene

Positive semidefinite rank Pablo A. Parrilo Laboratory for Information and Decision Systems

Sparsity and decomposition in semidefinite optimization Lieven Vandenberghe ECE Department, UCLA

Evaluation Robert W. Lindeman Worcester Polytechnic Institute Department of Computer Science

1 Clock skew optimization Another approach for sequential timing optimization

Cylindric Skew Schur Functions University of Minnesota Combinatorics Seminar 5 November 2004