Support vector machines (SVMs) Lecture 6 David Sontag New York - PowerPoint PPT Presentation

Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1

Pegasos vs. Perceptron Perceptron Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 0 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1

Much faster than previous methods • 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Pegasos SVM-Perf SVM-Light 2 77 20,075 Training Time Reuters (in seconds): 6 85 25,514 Covertype 2 5 80 Astro-Physics

Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Note: w 0 is redefined in this context (see below) – does not refer to initial weight err(w 0 ) vector • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision

Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Pegasos ✓ 1 err(w 0 ) ◆ T = ˜ After updates: O ��✏ • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer err(w T ) < err(w 0 ) + ✏ w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss With probability 1- δ w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision

Extending to multi-class classification

One versus all classification w + Learn 3 classifiers: w - • - vs {o,+}, weights w - • + vs {o,-}, weights w + • o vs {+,-}, weights w o w o Predict label using: Any problems? Could we learn this (1-D) dataset? � -1 1 0

Multi-class SVM w + Simultaneously learn 3 sets of weights: w - • How do we guarantee the correct labels? w o • Need new constraints! The “score” of the correct class must be better than the “score” of wrong classes:

Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it? � -1 1 0

How to deal with imbalanced data? • In many practical applications we may have imbalanced data sets • We may want errors to be equally distributed between the positive and negative classes • A slight modification to the SVM objective does the trick! Class-specific weighting of the slack variables

What if the data is not linearly separable? Use features of features of features of features…. x (1)   . . .     x ( n )     x (1) x (2)   φ ( x ) =   x (1) x (3)       . . .   e x (1)     . . . Feature space can get really large really quickly!

Key idea #3: the kernel trick • High dimensional feature spaces at no extra cost! • After every update (of Pegasos), the weight vector can be written in the form: X w = α i y i x i i • As a result, prediction can be performed with: y ← sign( w · φ ( x )) ˆ ⇣ ⌘ X = sign ( α i y i φ ( x i )) · φ ( x ) i ⇣ X ⌘ = sign α i y i ( φ ( x i ) · φ ( x )) i ⇣ X ⌘ = sign α i y i K ( x i , x ) where K ( x , x 0 ) = φ ( x ) · φ ( x 0 ) . i

Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

Quadratic kernel [Tommi Jaakkola]

Gaussian kernel Level sets, i.e. for some r Support vectors [Cynthia Rudin] [mblondel.org]

Kernel algebra Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: To see that this is a kernel, use the Taylor series expansion of the Then, apply (e) from above exponential, together with repeated application of (a), (b), and (c): The feature mapping is infinite dimensional! [Justin Domke]

Dual SVM interpretation: Sparsity w . x + b = +1 w . x + b = 0 w . x + b = -1 Final solution tends to be sparse • α j =0 for most j • don’t need to store these points to compute w or make predictions Non-support Vectors: • α j =0 Support Vectors: • moving them will not • α j ≥ 0 change w

Overfitting? • Huge feature space with kernels: should we worry about overfitting? – SVM objective seeks a solution with large margin • Theory says that large margin leads to good generalization (we will see this in a couple of lectures) – But everything overfits sometimes!!! – Can control by: • Setting C • Choosing a better Kernel • Varying parameters of the Kernel (width of Gaussian, etc.)

Support vector machines (SVMs) Lecture 6 David Sontag New York - PowerPoint PPT Presentation

Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,,20 For

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Lecture 20: Support Vector Machines (SVMs) CS109A Introduction to Data Science Pavlos Protopapas

Support vector machines (SVMs) Lecture 5 David Sontag New

Support vector machines (SVMs) Lecture 3 David Sontag New York University Slides adapted from

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

ACIP COVID-19 Vaccines Work Group Dr. Beth Bell, Work Group Chair October 30, 2020 For more

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke

The Role of Information and Communication Technologies in the Development of Inclusive Society for

CPSC 490 DP Part 2: Max-IS on Arrays, LCS, Recovery, and Binary Exponentiation Lucca Siaudzionis

Chinese government (Picture credit: Reuters.) is under attack from terrorists in Hong Kong. How

19 th XBRL International Conference Reducing reporting burden with XBRL: a catalyst for better

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Empirical Methods in Natural Language Processing Lecture 18 Machine translation (V): Syntax-Based