A Section 9: Support Vector Machines Prepared & Presented by - PowerPoint PPT Presentation

A Section 9: Support Vector Machines Prepared & Presented by Will Claybaugh CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

What do you get when you cross an elephant and a rhino? Q: What does logistic regression think of LDA/QDA? CS109A, P ROTOPAPAS , R ADER 2

What do you get when you cross an elephant and a rhino? Q: What does logistic regression think of LDA/QDA? CS109A, P ROTOPAPAS , R ADER 3

• LDA/QDA tell the complete story of how the data came to be • Correspondingly, it makes heavy assumptions, and much can go wrong A: You’re modelling too much • Logistic doesn’t care how the X data came to be, it only tells the story of the Y data • Since there are fewer assumptions, the math is more advanced and the method is slower CS109A, P ROTOPAPAS , R ADER 4

Anyone take the old SATs? SVM:Logistic Regression::Logistic Regression:QDA CS109A, P ROTOPAPAS , R ADER 5

Less is More SVMs • Only predict the final class, not the probability of each class • Make no assumptions about the data • Still work well with large numbers of features CS109A, P ROTOPAPAS , R ADER 6

Our Path I: Get comfy with the key expressions and concepts • Bundles, signed distance, class-based distance II: Extract the highlights of SVMs from the loss function • Only certain observations matter; effects of the C parameter • III: Derivation of the primal and dual problems, fulfilling the promises from Part II Lagrangian , Piramal/Dual games, KKT conditions as souped-up “derivative=0” • IV: Interpret the dual problem and see SVMs in a new way SVMs can be seen as an advanced neighbors-style algorithm CS109A, P ROTOPAPAS , R ADER 7

Part I REVIEW P 8

Act I: Setting • Like Logistic regression, SVMs set three parameters: a weight on each feature (w1 and w2) and an intercept (b) This is MORE than we need to define a line • • So what are we really defining? CS109A, P ROTOPAPAS , R ADER 9

𝑥 " 𝑦 + 𝑐 : Signed distance Key Concept #1 Via 𝑥 " 𝑦 + 𝑐 , 𝑥 and 𝑐 define an output at each point of input • space 𝑥 " 𝑦 + 𝑐 = • This is our first key quantity, and will live in our ‘reminder corner’ 𝑥 " 𝑦 + 𝑐 gives us: • The rule to classify test points: if 𝑥 " 𝑦 + • 𝑐 is + classify as +; if - classify as - • A new measure of distance [from the ⁄ decision boundary in units of 1 𝑥 ] • We [arbitrarily] define +1 and -1 as the margin for a given 𝑥, 𝑐 (bundle) CS109A, P ROTOPAPAS , R ADER 10

𝑥 " 𝑦 + 𝑐 : Signed distance Live Demo DEMO: In the notebook, we manipulate w1, w2, and b to see how they affect the bundle produced Conclusions: 𝑥1 and 𝑥2 control the slope of the bundle, and the larger • the norm, the more tightly packed the bundle is 𝑐 controls the height of the bundle, but its effect depends • on the magnitude of 𝑥1 and 𝑥2 CS109A, P ROTOPAPAS , R ADER 11

𝑥 " 𝑦 + 𝑐 : Signed distance Key Concept #2 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Class-based distance The expression 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) occurs a ton with SVMs • • It takes the signed distance function and multiplies it by an observation’s class • We’re calling it “class-based distance” 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) = Example: 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) = 2 – is 0 on at the decision boundary −2 – is above 1 if you are safely beyond your margin 1 – is 1 (or less) if you are crowding the margin or misclassified 3 −1 – is negative if you’re really messing up CS109A, P ROTOPAPAS , R ADER 12

𝑥 " 𝑦 + 𝑐 : Signed distance 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Class-based distance A table of the key quantities at each point Point Class Signed Class-based Loss Distance distance 𝐷 A - -3 3 None 𝐸 B - -1 1 Marginal 𝐶 C + 2 2 None 𝐵 𝐹 D - 2 -2 Misclass E + -1 -1 Misclass CS109A, P ROTOPAPAS , R ADER 13

𝑥 " 𝑦 + 𝑐 : Signed distance Kernels 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Class-based distance The same ‘signed distance’ concepts apply to kernels, although: 1. The lines get wavy 2. The way we measure distance is less clear Later on, we’ll learn • What kind distance is used for kernels • Standard distance isn’t what you think CS109A, P ROTOPAPAS , R ADER 14

𝑥 " 𝑦 + 𝑐 : Signed distance Recap 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Class-based distance Recap: • We’re picking a best bundle (set of weights and b) The bundle implies a signed ‘distance’ 𝑥 " 𝑦 + 𝑐 over the • space, where 0 is the decision boundary Class-based distance 𝑧 9 𝑥 " 𝑦 9 + 𝑐 is directly related to how • sad we are about a training point • Kernels put a wavy set of lines over the input space, instead of level ones CS109A, P ROTOPAPAS , R ADER 15

Part II LOSS FUNCTIONS 16

𝑥 " 𝑦 + 𝑐 : Signed distance Hinge Loss 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Class-based distance 1 − 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Loss We saw 1 was a critical value for 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) • Above 1 means you’re safely within your margin • Below 1 means you’re crowding the margin • Below 0 means you’re misclassified Make it a loss function: Negate so bigger values are worse, not • (1 − 𝑧 9 (𝑥 " 𝑦 9 + 𝑐), 0) 𝑀𝑝𝑡𝑡 = max better, • + 1 so points within their margin get loss 0 instead of -1 • If the loss would be negative, record 0 instead CS109A, P ROTOPAPAS , R ADER 17

𝑥 " 𝑦 + 𝑐 : Signed distance Loss 1 − 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Loss Which do you like best? • CS109A, P ROTOPAPAS , R ADER 18

𝑥 " 𝑦 + 𝑐 : Signed distance Act II: Loss 1 − 𝑧 9 (𝑥 " 𝑦 9 + 𝑐) : Loss Which do you like best? • CS109A, P ROTOPAPAS , R ADER 19

� � � 𝑥 " 𝑦 + 𝑐 : Signed distance The Loss Function 𝑥 W + 𝐷 ∑ (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) max : Loss (margin+invasion) RST9U Tradeoff exists between wanting wider margins and discomfort with • points inside the margins (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) + 𝜇 𝑥 W 𝑀𝑝𝑡𝑡(𝑥, 𝑐, 𝑢𝑠𝑏𝑗𝑜 𝑒𝑏𝑢𝑏) = P max RST9U View A : minimize hinge loss, 𝑀 W regularization • 𝑥 W + 𝐷 P max (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) 𝑀𝑝𝑡𝑡 𝑥, 𝑐, 𝑢𝑠𝑏𝑗𝑜 𝑒𝑏𝑢𝑏 = RST9U • View B : maximize the margin, but pay a price for points inside the margin (or misclassified) CS109A, P ROTOPAPAS , R ADER 20

� 𝑥 " 𝑦 + 𝑐 : Signed distance Live Demo 𝑥 W + 𝐷 ∑ (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) max : Loss (margin+invasion) RST9U DEMO: In the notebook, we manipulate 𝐷 and see how the solution found by SVM changes Conclusions: Big 𝐷 : we do anything to reduce invasion losses • If seperable: finds separating plane • • If not: lumps non-separable points into margin, separates the rest Small 𝐷 : we stop caring about invasion (or even • misclassification); just grow the margin CS109A, P ROTOPAPAS , R ADER 21

� 𝑥 " 𝑦 + 𝑐 : Signed distance Observations 𝑥 W + 𝐷 ∑ (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) max : Loss (margin+invasion) RST9U Observations from SVM loss: Hinge loss zero for most points 1. – most points are behind the margin Moving/deleting these points 2. wouldn’t change the solution The outcome for a test point only 3. depends on a handful of training points Should be able to write output value • as combination of (-2,1) and (1,2) Key question: HOW can we determine • a test point’s class using the few important training points? Leads to re-casting as a fancified • neighbors algorithm CS109A, P ROTOPAPAS , R ADER 22

� 𝑥 " 𝑦 + 𝑐 : Signed distance What to watch for 𝑥 W + 𝐷 ∑ (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) max : Loss (margin+invasion) RST9U Our reward for sitting through the math: 1. A recipe for the most important training points 2. A way to make decisions while throwing out most of the training data 3. A new and more powerful view of what SVMs do Like studying linear regression’s loss minimization via calculus, but with a harder target and more advanced math CS109A, P ROTOPAPAS , R ADER 23

Part III MATH Ideas : http://cs229.stanford.edu/notes/cs229-notes3.pdf Soft-Margin derivation : http://www.ccs.neu.edu/home/vip/teach/MLcourse/6_SVM_kernels/lecture_notes/svm/svm.pdf 24

� 𝑥 " 𝑦 + 𝑐 : Signed distance Author’s Proof 𝑥 W + 𝐷 ∑ (1 − 𝑧 9 𝑥 " 𝑦 9 + 𝑐 , 0) max : Loss (margin+invasion) RST9U Outline proof steps 1. Re-cast the loss function as a convex optimization 2. Re-write the one-player game into a two-player game (Primal) 3. Rewrite the two-player game into an equivalent game with opposite turn order (Dual) 4. Observe that assigning (mostly-zero) importance scores to each training point is equivalent to solving the original optimization (KKT) 5. Observe that our original SVM formulation was using a very counter-intuitive definition of distance, and we can do better CS109A, P ROTOPAPAS , R ADER 25

A Section 9: Support Vector Machines Prepared & Presented by - PowerPoint PPT Presentation

A Section 9: Support Vector Machines Prepared & Presented by Will Claybaugh CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader What do you get when you cross an elephant and a rhino? Q: What does logistic regression think

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Current Optimization for Electrically Small Antennas Miloslav Capek Department of

Development of a Fast Physically-Based Urban Wind Flow and Dispersion Model for Emergency

A LEVEL SET MODEL FOR DELAMINATION ANALYSIS WITH LARGE ELEMENTS F.P. van der Meer 1,2 *, N. Mo

Robust Multi-Objective Control for Linear Systems Elements of theory and ROMULOC toolbox OLOCEP

Application of the method of differential constraints to constructing exact solutions of the gas

Deep Representation and Reinforcement Learning Soumalya Sarkar, PhD for Anomaly Detection and

Expansion and Operation Strategies in a Renewable and Hydro-Based Island Power System 4th IEEE

Structured Sparsity in Gabor Analysis Dominik Fuchs University of Vienna Faculty of Mathematics

Sambuz

Useful Links

Newsletter

Mail Us

A Section 9: Support Vector Machines Prepared & Presented by - PowerPoint PPT Presentation

A Section 9: Support Vector Machines Prepared & Presented by Will Claybaugh CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader What do you get when you cross an elephant and a rhino? Q: What does logistic regression think

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Current Optimization for Electrically Small Antennas Miloslav Capek Department of

Development of a Fast Physically-Based Urban Wind Flow and Dispersion Model for Emergency

A LEVEL SET MODEL FOR DELAMINATION ANALYSIS WITH LARGE ELEMENTS F.P. van der Meer 1,2 *, N. Mo

Robust Multi-Objective Control for Linear Systems Elements of theory and ROMULOC toolbox OLOCEP

Application of the method of differential constraints to constructing exact solutions of the gas

Deep Representation and Reinforcement Learning Soumalya Sarkar, PhD for Anomaly Detection and

Expansion and Operation Strategies in a Renewable and Hydro-Based Island Power System 4th IEEE

Structured Sparsity in Gabor Analysis Dominik Fuchs University of Vienna Faculty of Mathematics

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David