Support Vector Machines Prof. Mike Hughes Many ideas/slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

SVM Unit Objectives Big ideas: Support Vector Machine • Why maximize margin? • What is a support vector? • What is hinge loss? • Advantages over logistic regression • Less sensitive to outliers • Advantages from sparsity in when using kernels • Disadvantages • Not probabilistic • Less elegant to do multi-class Mike Hughes - Tufts COMP 135 - Spring 2019 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

Recall: Kernelized Regression Mike Hughes - Tufts COMP 135 - Spring 2019 5

Linear Regression Mike Hughes - Tufts COMP 135 - Spring 2019 6

Kernel Function k ( x i , x j ) = φ ( x i ) T φ ( x j ) Interpret: similarity function for x_i and x_j Properties: Input: any two vectors Output: scalar real larger values mean more similar symmetric Mike Hughes - Tufts COMP 135 - Spring 2019 7

Common Kernels Mike Hughes - Tufts COMP 135 - Spring 2019 8

Kernel Matrices • K_train : N x N symmetric matrix   k ( x 1 , x 1 ) k ( x 1 , x 2 ) . . . k ( x 1 , x N ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) . . . k ( x 2 , x N )   K = .   .   .   k ( x N , x 1 ) k ( x N , x 2 ) . . . k ( x N , x N ) • K_test : T x N matrix for test feature Mike Hughes - Tufts COMP 135 - Spring 2019 9

Linear Kernel Regression Mike Hughes - Tufts COMP 135 - Spring 2019 10

Radial basis kernel aka Gaussian aka Squared Exponential Mike Hughes - Tufts COMP 135 - Spring 2019 11

Compare: Linear Regression Prediction : Linear transform of G-dim features G y ( x i , θ ) = θ T φ ( x i ) = X ˆ θ g · φ ( x i ) g g =1 Training : Solve G-dim optimization problem N + L2 penalty y ( x n , θ )) 2 X min ( y n − ˆ (optional) θ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 12

Kernelized Linear Regression • Prediction : N X y ( x i , α , { x n } N ˆ n =1 ) = α n k ( x n , x i ) = X n =1 • Training: Solve N-dim optimization problem N X y ( x n , α , X )) 2 min ( y n − ˆ α n =1 Can do all needed operations with only access to kernel (no feature vectors are created in memory) Mike Hughes - Tufts COMP 135 - Spring 2019 13

Math on board • Goal: Kernelized linear prediction reweights each training example Mike Hughes - Tufts COMP 135 - Spring 2019 14

Can kernelize any linear model Regression: Prediction N X y ( x i , α , { x n } N ˆ n =1 ) = α n k ( x n , x i ) n =1 Logistic Regression: Prediction p ( Y i = 1 | x i ) = σ (ˆ y ( x i , α , X )) Mike Hughes - Tufts COMP 135 - Spring 2019 15

Training : Reg. Vs. Logistic Reg. N y ( x n , α , X )) 2 X min ( y n − ˆ α n =1 N X min log loss( y n , σ (ˆ y ( x n , α , X ))) α n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 16

Downsides of LR Log loss means any example misclassified pays a steep price, pretty sensitive to outliers Mike Hughes - Tufts COMP 135 - Spring 2019 17

Stepping back Which do we prefer? Why? Mike Hughes - Tufts COMP 135 - Spring 2019 18

Idea: Define Regions Separated by Wide Margin Mike Hughes - Tufts COMP 135 - Spring 2019 19

w is perpendicular to boundary Mike Hughes - Tufts COMP 135 - Spring 2019 20

Examples that define the margin are called support vectors Nearest positive example Nearest negative example Mike Hughes - Tufts COMP 135 - Spring 2019 21

Observation: Non-support training examples do not influence margin at all Could perturb these examples without impacting boundary Mike Hughes - Tufts COMP 135 - Spring 2019 22

How wide is the margin? Mike Hughes - Tufts COMP 135 - Spring 2019 23

Small margin ! = #+1 &' w x + b ≥ 0 −1 &' w x + b < 0 y positive w x + b>0 y negative Margin: distance to the boundary w x + b<0 Mike Hughes - Tufts COMP 135 - Spring 2019 24

Large margin ! = #+1 &' w x + b ≥ 0 −1 &' w x + b < 0 y positive w x + b>0 y negative Margin: distance to the boundary w x + b<0 Mike Hughes - Tufts COMP 135 - Spring 2019 25

How wide is the margin? Distance from nearest positive example to nearest negative example along vector w: Mike Hughes - Tufts COMP 135 - Spring 2019 26

How wide is the margin? Distance from nearest positive example to nearest negative example along vector w: By construction, we assume Mike Hughes - Tufts COMP 135 - Spring 2019 27

SVM Training Problem Version 1: Hard margin For each n = 1, 2, …. N Requires all training examples to be correctly classified This is a constrained quadratic optimization problem. There are standard optimization methods as well methods specially designed for SVM. Mike Hughes - Tufts COMP 135 - Spring 2019 28

SVM Training Problem Version 1: Hard margin Minimizing this equivalent to maximizing the margin width in feature space For each n = 1, 2, …. N Requires all training examples to be correctly classified This is a constrained quadratic optimization problem. There are standard optimization methods as well methods specially designed for SVM. Mike Hughes - Tufts COMP 135 - Spring 2019 29

Soft margin: Allow some misclassifications Slack at example i Distance on wrong side of the margin Mike Hughes - Tufts COMP 135 - Spring 2019 30

Hard vs. soft constraints HARD: All positive examples to satisfy SOFT: Want each positive examples to satisfy with slack as small as possible (minimize absolute value ) Mike Hughes - Tufts COMP 135 - Spring 2019 31

Hinge loss for positive example Assumes current example has positive label y = +1 +1 0 Mike Hughes - Tufts COMP 135 - Spring 2019 32

SVM Training Problem Version 2: Soft margin Tradeoff parameter C controls model complexity Smaller C: Simpler model, encourage large margin even if we make lots of mistakes Bigger C: Avoid mistakes Mike Hughes - Tufts COMP 135 - Spring 2019 33

SVM vs LR: Compare training Both require tuning complexity hyperparameter C to avoid overfitting Mike Hughes - Tufts COMP 135 - Spring 2019 34

Loss functions: SVM vs Logistic Regr. Mike Hughes - Tufts COMP 135 - Spring 2019 35

SVMs: Prediction Make binary prediction via hard threshold Mike Hughes - Tufts COMP 135 - Spring 2019 36

SVMs and Kernels: Prediction Make binary prediction via hard threshold Efficient training algorithms using modern quadratic programming solve the dual optimization problem of SVM soft margin problem Mike Hughes - Tufts COMP 135 - Spring 2019 37

Support vectors are often small fraction of all examples Nearest positive example Nearest negative example Mike Hughes - Tufts COMP 135 - Spring 2019 38

Support vectors defined by non-zero alpha in kernel view Mike Hughes - Tufts COMP 135 - Spring 2019 39

SVM + Squared Exponential Kernel Mike Hughes - Tufts COMP 135 - Spring 2019 40

SVM Logistic Regr Loss hinge cross entropy (log loss) Sensitive to less more outliers Probabilistic? no yes Kernelizable? Yes, with speed Yes benefits from sparsity Multi-class? Only via a separate Easy , use softmax one-vs-all for each class Mike Hughes - Tufts COMP 135 - Spring 2019 41

Activity • KernelRegressionDemo.ipynb • Scroll down to SVM vs logistic regression • Can you visualize behavior with different C? • Try different kernels? • Examine alpha vector? Mike Hughes - Tufts COMP 135 - Spring 2019 42

Support Vector Machines Prof. Mike Hughes Many ideas/slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

E9 205 Machine Learning for Signal Processing Linear Predictive Analysis 22-08-2016 Linear

STAT 113 Simple Linear Regression Colin Reimer Dawson Oberlin College Sept. 16, 2015 Outline

Linear Prediction Analysis of Speech Sounds Berlin Chen 2003 References: 1. X. Huang et. al.,

LSI system Input v v1 x v1 x v2 x v2 x + + L + v3 x + v3 x v4 x + v4 x + Output

PINFER: PRIVACY-PRESERVING INFERENCE DPM 2019 Luxembourg Sep. 26, 2019 Marc Joye Fabien

Register allocation Michel Schinz (based on Erik Stenmans slides) Advanced Compiler

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Systems of Linear Equations Marco Chiarandini Department of Mathematics & Computer Science

Support Vector Machines Prof. Mike Hughes Many ideas/slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

E9 205 Machine Learning for Signal Processing Linear Predictive Analysis 22-08-2016 Linear

STAT 113 Simple Linear Regression Colin Reimer Dawson Oberlin College Sept. 16, 2015 Outline

Linear Prediction Analysis of Speech Sounds Berlin Chen 2003 References: 1. X. Huang et. al.,

LSI system Input v v1 x v1 x v2 x v2 x + + L + v3 x + v3 x v4 x + v4 x + Output

PINFER: PRIVACY-PRESERVING INFERENCE DPM 2019 Luxembourg Sep. 26, 2019 Marc Joye Fabien

Register allocation Michel Schinz (based on Erik Stenmans slides) Advanced Compiler

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

Systems of Linear Equations Marco Chiarandini Department of Mathematics &amp; Computer Science

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Systems of Linear Equations Marco Chiarandini Department of Mathematics & Computer Science