L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu

Midterm (Thursday, March 5, in class) CS446 Machine Learning 2

Format Closed book exam (during class): – You are not allowed to use any cheat sheets, computers, calculators, phones etc. (you shouldn’t have to anyway) – Only the material covered in lectures (Assignments have gone beyond what’s covered in class) – Bring a pen (black/blue). CS446 Machine Learning 3

Sample questions What is n -fold cross-validation, and what is its advantage over standard evaluation? Good solution: – Standard evaluation: split data into test and training data (optional: validation set) – n -fold cross validation: split the data set into n parts, run n experiments, each using a different part as test set and the remainder as training data. – Advantage of n- fold cross validation: because we can report expected accuracy, and variances/standard deviation, we get better estimates of the performance of a classifier. CS446 Machine Learning 4

Question types – Define X: Provide a mathematical/formal definition of X – Explain what X is/does: Use plain English to say what X is/does – Compute X: Return X; Show the steps required to calculate it – Show/Prove that X is true/false/…: This requires a (typically very simple) proof. CS446 Machine Learning 5

Back to the material… CS446 Machine Learning 6

Last lecture’s key concepts Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines CS446 Machine Learning 7

Today’s key concepts Review of SVMs Dealing with outliers: Soft margins Soft margin SVMs and Regularization SGD for soft margin SVMs CS446 Machine Learning 8

Review of SVMs CS446 Machine Learning 9

Maximum margin classifiers These decision boundaries are very close This decision boundary is as far away to some items in the training data. from any training items as possible. They have small margins. It has a large margin. Minor changes in the data could lead to Minor changes in the data result in different decision boundaries (roughly) the same decision boundary CS446 Machine Learning 10

Euclidean distances If the dataset is linearly separable, the Euclidean (geometric) distance of x (i) to the hyperplane wx + b = 0 is ( i ) + b wx ( i ) + b ( ) y ( i ) ∑ = y ( i ) ( wx ( i ) + b ) w n x n n = w w ∑ w n w n n The Euclidean distance of the data to the decision boundary will depend on the dataset. CS446 Machine Learning 11

Support Vector Machines Distance of the training example x (i) from the decision boundary wx + b = 0: y ( i ) ( wx ( i ) + b ) w Learning an SVM = find parameters w , b such that the decision boundary wx + b = 0 is furthest away from the training examples closest to it: functional distance to the closest training examples % ) ' ' 1 y ( n ) ( wx ( n ) + b ) ! # argmax w min & * " $ ' ' n ( + w , b Find the boundary wx + b = 0 with maximal distance to the data CS446 Machine Learning 12

Support vectors and functional margins Functional distance of a training example ( x (k) , y (k) ) from the decision boundary: y (k) f( x (k) ) = y (k) ( wx (k) + b) = γ Support vectors: the training examples ( x (k) , y (k) ) that have a functional distance of 1 y (k) f( x (k) ) = y (k) ( wx (k) + b) = 1 All other examples are further away from the decision boundary. Hence ∀ k: y (k) f( x (k) ) = y (k) ( wx (k) + b) ≥ 1 CS446 Machine Learning 13

Rescaling w and b Rescaling w and b by a factor k to k w and kb changes the functional distance of the data but does not affect geometric distances (see last lecture) We can therefore decide to fix the functional margin (distance of the closest points to the decision boundary) to 1, regardless of their Euclidean distances. CS446 Machine Learning 14

Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: – We can always rescale w and b without affecting Euclidean distances. – This allows us to set the functional margin to 1: min n (y (n) ( wx (n) + b ) = 1 CS446 Machine Learning 15

Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: a quadratic program – Setting min n (y (n) ( wx (n) + b ) = 1 implies (y (n) ( wx (n) + b ) ≥ 1 for all n – argmax(1/ ww )= argmin( ww ) = argmin(1/2· ww ) 1 argmin 2 w ⋅ w w subject to y i ( w ⋅ x i + b ) ≥ 1 ∀ i CS446 Machine Learning 16

Support vectors: Examples with a functional margin of 1 + + f( x i ) − y i = 1 + + f( x k ) − y k = 1 Margin m � + + − − f( x ) = 0 − − − f( x j ) − y j =1 − − 17

Support Vector Machines The name “Support Vector Machine” stems from the fact that w * is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/|| w *|| from the separating hyperplane. These vectors are therefore called support vectors . Theorem: Let w * be the minimizer of the SVM optimization problem for S = {( x i , y i )}. Let I= {i: y i ( w * x i + b) = 1}. Then there exist coefficients α i > 0 such that: w * = ∑ i ∈ ¡I α i y i x i ¡ Support vectors = the set of data points x j with non-zero weights α j ¡ 18

Summary: (Hard) SVMs If the training data is linearly separable, there will be a decision boundary wx + b = 0 that perfectly separates it, and where all the items have a functional distance of at least 1: y (i) ( wx (i) + b) ≥ 1 We can find w and b with a quadratic program: 1 argmin 2 w ⋅ w w , b subject to y i ( w ⋅ x i + b ) ≥ 1 ∀ i CS446 Machine Learning 19

Dealing with outliers: Soft margins

Dealing with outliers Not every dataset is linearly separable. There may be outliers: CS446 Machine Learning 21

Dealing with outliers: Slack variables ξ i Associate each ( x (i) , y (i) ) with a slack variable ξ i that measures by how much it fails to achieve the desired margin δ CS446 Machine Learning 22

Dealing with outliers: Slack variables ξ i If x (i) is on the correct side of the margin: wx (i) + b ≥ 1: ξ i = 0 If x (i) is on the wrong side of the margin: wx (i) + b < 1: ξ i > 0 If x (i) is on the decision boundary: wx (i) + b = 1: ξ i = 1 Hence, we will now assume that wx (i) + b ≥ 1 − ξ i CS446 Machine Learning 23

Hinge loss and SVMs L hinge (y (n) , f( x (n) )) = max(0, 1 − y (n) f( x (n) )) Case 0: f( x ) = 1 Loss as a function of y*f(x) x is a support vector 4 Hinge Loss Hinge loss = 0 Case 1: f( x ) > 1 3 x outside of margin Hinge loss = 0 yf(x) 2 Case 2: 0< yf( x ) <1: x inside of margin 1 Hinge loss = 1-yf( x ) Case 3: yf( x ) < 0: x misclassified 0 Hinge loss = 1-yf( x ) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 24

From Hard SVM to Soft SVM Replace y (n) ( wx (n) + b ) ≥ 1 (hard margin) with y (n) ( wx (n) + b ) ≥ 1 − ξ (n) (soft margin) y (n) ( wx (n) + b ) ≥ 1 − ξ (n) is the same as ξ (n) ≥ 1 − y (n) ( wx (n) + b ) Since ξ (n) > 0 only if x (n) is on the wrong side of the margin, i.e. if y (n) ( wx (n) + b ) < 1, this is the same as the hinge loss: L hinge (y (n) , f( x (n) )) = max(0, 1 − y (n) f( x (n) )) CS446 Machine Learning 25

Soft margin SVMs n 1 ∑ argmin 2 w ⋅ w + C ξ i w , b , ξ i i = 1 subject to ξ i ≥ 0 ∀ i y i ( w ⋅ x i + b ) ≥ (1 − ξ i ) ∀ i ξ i (slack): how far off is x i from the margin? C (cost): how much do we have to pay for misclassifying x i We want to minimize C ∑ i ξ i and maximize the margin C controls the tradeoff between margin and training error CS446 Machine Learning 26

Soft SVMs = Regularized Hinge Loss: We can rewrite this as: n 1 L hinge ( y ( n ) , x ( n ) ) ∑ argmin 2 w ⋅ w + C w , b i = 1 n 1 max(0,1 − y ( n ) ( wx ( n ) + b ) ∑ = argmin 2 w ⋅ w + C w , b i = 1 The parameter C controls the tradeoff between choosing a large margin (small || w ||) and choosing a small hinge-loss. CS446 Machine Learning 27

Soft SVMs = Regularized Hinge Loss: n 1 L hinge ( y ( n ) , x ( n ) ) ∑ argmin 2 w ⋅ w + C w , b i = 1 We minimize both the l2-norm of the weight vector || w || = √ ww and the hinge loss. Minimizing the norm of w is called regularization. CS446 Machine Learning 28

Regularized Loss Minimization Empirical Loss Minimization: argmin w L( D ) L( D ) = ∑ i L(y (i) , x (i) ): Loss of w on training data D Regularized Loss Minimization: Include a regularizer R( w ) that constrains w e.g. L2-regularization: R( w )= λ ‖ w ‖ 2 argmin w (L( D ) + R( w )) λ controls the tradeoff between empirical loss and regularization. CS446 Machine Learning 29

Training SVMs Traditional approach: Solve quadratic program. – This is very slow. Current approaches: Use variants of stochastic gradient descent or coordinate descent. CS446 Machine Learning 30

Gradient of hinge loss at x (n) L hinge (y (n) , f( x (n) )) = max(0, 1 − y (n) f( x (n) )) Gradient If y (n) f( x (n) ) ≥ 1: set the gradient to 0 If y (n) f( x (n) ) < 1: set the gradient to -y (n) x (n) CS446 Machine Learning 31

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March 5, in class)

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Collision dynamics in GRB internal shocks And their implication for the production of multiple

Q1 2007 FINANCIAL Investor Community Conference Call RESULTS KAREN MAIDMENT Chief Financial

SILK Overview IETF codec WG, Nov 8, 2010 Koen Vos Decoder Encoder Adaptive High-Pass Filter

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March 5, in class)

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness &amp; Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Collision dynamics in GRB internal shocks And their implication for the production of multiple

Q1 2007 FINANCIAL Investor Community Conference Call RESULTS KAREN MAIDMENT Chief Financial

SILK Overview IETF codec WG, Nov 8, 2010 Koen Vos Decoder Encoder Adaptive High-Pass Filter

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and