L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu

Today’s class Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines Dealing with outliers: – Soft margins CS446 Machine Learning 2

Large margin classifiers

What’s the best separating hyperplane? + + + + + + − − − − − − − 4

What’s the best separating hyperplane? + + + + + + − − − − − − − 5

What’s the best separating hyperplane? + + + + Margin m � + + − − − − − − − 6

Maximum margin classifiers These decision boundaries are very close This decision boundary is as far away to some items in the training data. from any training items as possible. They have small margins. It has a large margin. Minor changes in the data could lead to Minor changes in the data result in different decision boundaries (roughly) the same decision boundary CS446 Machine Learning 7

Maximum margin classifier Margin = the distance of the decision boundary to the closest items in the training data. We want to find a classifier whose decision boundary is furthest away from the nearest data points. (This classifier has the largest margin). This additional requirement ( bias) reduces the variance (i.e. reduces overfitting). CS440/ECE448: Intro AI 8

Margins

Margins Absolute distance of point x Distance of hyperplane to hyperplane hyperplane wx + b = 0 wx + b = 0: wx + b = 0 point x to origin: wx + b − b w w w Decision boundary: Hyperplane with f( x ) = 0 i.e. wx + b = 0 CS446 Machine Learning 10

Margin If the data are linearly separable, y (i) ( wx (i) +b) > 0 Euclidean distance of x (i) to the decision boundary: y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w CS446 Machine Learning 11

Functional vs. Geometric margin Geometric margin (Euclidean distance) of hyperplane wx + b = 0 to point x (i) : y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w Functional margin of hyperplane wx + b = 0 to point x (i) : γ = y (i) f( x (i) ) i.e . γ = y (i) ( wx (i) + b ) CS446 Machine Learning 12

Rescaling w and b Rescaling w and b by a factor k to k w and kb does not change the geometric margin (Euclidean distance): …spell out wx , ‖ w ‖ … …multiply by k/k … Geometric margin of x to wx + b = 0 " % " % y ( i ) ( i ) ky ( i ) ( i ) ∑ w n x n + b ∑ w n x n + b $ ' $ ' y ( i ) ( wx ( i ) + b ) # & # & n n = = w ∑ w n w n k ∑ w n w n n n " % y ( i ) ( i ) ∑ kw n x n + kb $ ' y ( i ) ( k wx ( i ) + kb ) # & n = = k w ∑ kw n kw n n Geometric margin of x …move k inside… to k wx + k b = 0 CS446 Machine Learning 13

Rescaling w and b Rescaling w and b by a factor k does change the functional margin γ by a factor k: γ = y (i) ( wx (i) + b ) k γ = y (i) ( k wx (i) + kb ) The point that is closest to the decision boundary has functional margin γ min – w and b can be rescaled so that γ min = 1 – When learning w and b , we can set γ min = 1 (and still get the same decision boundary) CS446 Machine Learning 14

The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 15

Hinge loss L(y, f( x )) = max(0, 1 − yf( x )) Case 1: f( x ) > 1 x outside of margin Loss as a function of y*f(x) Hinge loss = 0 4 Hinge Loss Case 2: 0< yf( x ) <1: x inside of margin 3 Hinge loss = 1-yf( x ) yf(x) 2 Case 3: yf( x ) < 0: x misclassified 1 Hinge loss = 1-yf( x ) 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 16

Perceptron with margin CS446 Machine Learning 17

Perceptron with Margin Standard Perceptron update: Update w if y m · w · x m < 0 Perceptron with Margin update: Define a functional margin γ > 0 Update w if y m · w · x m < γ CS446 Machine Learning 18

Support Vector Machines CS446 Machine Learning 19

The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 20

The maximum margin decision boundary… … is defined by two parallel hyperplanes: – one that goes through the positive data points (y j = +1) that are closest to the decision boundary, and – one that goes through the negative data points (y j = − 1) that are closest to the decision boundary. CS440/ECE448: Intro AI 21

Support vectors We can express the separating hyperplane in terms of the data points x j closest to the decision boundary. These data points are called the support vectors. CS440/ECE448: Intro AI 22

Support vectors + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 23

Perceptrons and SVMs: Differences in notation Perceptrons: – Weight vector has bias term w 0 ( x 0 = dummy value 1) – Decision boundary: wx = 0 SVMs/Large Margin classifiers: – Explicit bias term b; weight vector w = ( w 1 …w n ) – Decision boundary wx + b = 0 CS440/ECE448: Intro AI 24

Support Vector Machines The functional margin of the data for ( w , b) is determined by the points closest to the hyperplane y ( n ) ( wx ( n ) + b ) ! # γ min = min " $ n wx + b Distance of x (n) to the hyperplane wx = 0: w Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' ' n ( + w , b CS446 Machine Learning 25

Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b This is difficult to optimize. Let’s convert it to an equivalent problem that is easier. CS446 Machine Learning 26

Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: – We can always rescale w and b without affecting Euclidian distances. – This allows us to set the functional margin to 1: min n (y (n) ( wx (n) + b ) = 1 CS446 Machine Learning 27

Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: a quadratic program – Setting min n (y (n) ( wx (n) + b ) = 1 implies (y (n) ( wx (n) + b ) ≥ 1 for all n – argmax(1/ ww )= argmin( ww ) = argmin(1/2· ww ) 1 argmin 2 w ⋅ w w subject to y i ( w ⋅ x i + b ) ≥ 1 ∀ i CS446 Machine Learning 28

Support Vector Machines The name “Support Vector Machine” stems from the fact that w * is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/|| w *|| from the separating hyperplane. These vectors are therefore called support vectors . Theorem: Let w * be the minimizer of the SVM optimization problem for S = {( x i , y i )}. Let I= {i: y i ( w * x i + b) = 1}. Then there exist coefficients α i > 0 such that: w * = ∑ i ∈ ¡I α i y i x i ¡ 29

The primal representation The data items x = (x 1 …x n ) have n features The weight vector w = (w 1 …w n ) has n elements Learning: Find a weight w j for each feature x j Classification: Evaluate wx CS440/ECE448: Intro AI 30

The dual representation ∑ w = α j x j j Learning: Find a weight α j ( ≥ 0) for each data point x j This requires computing the inner product x i x j between all pairs of data items x i and x j Support vectors = the set of data points x j with non-zero weights α j CS440/ECE448: Intro AI 31

Classifying test data with SVM In the primal: Compute inner product between weight vector and test item wx = 〈 w, x 〉 In the dual: Compute inner products between the support vectors and test item wx = 〈 w, x 〉 = 〈 ∑ j α j x j , x 〉 = ∑ j α j 〈 x j , x 〉 CS440/ECE448: Intro AI 32

Dealing with outliers: Soft margins

Dealing with outliers: Slack variables ξ i ξ i measures by how much example ( x i , y i ) fails to achieve margin δ CS446 Machine Learning 34

Dealing with outliers: Slack variables ξ i If x i is on correct side of the margin: ξ i = 0 otherwise ξ i = |y i − wx i | If ξ i = 1: x i is on the decision boundary wx i = 0 If ξ i > 1: x i is misclassified Replace y (n) ( wx (n) + b ) ≥ 1 (hard margin) with y (n) ( wx (n) + b ) ≥ 1 − ξ (n) (soft margin) CS446 Machine Learning 35

Soft margins n 1 ∑ argmin 2 w ⋅ w + C ξ i w i = 1 subject to ξ i ≥ 0 ∀ i y i ( w ⋅ x i + b ) ≥ (1 − ξ i ) ∀ i ξ i (slack): how far off is x i from the margin? C (cost): how much do we have to pay for misclassifying x i We want to minimize C ∑ i ξ i and maximize the margin C controls the tradeoff between margin and training error CS446 Machine Learning 36

Soft SVMs Now the optimization problem becomes Min w ½ || w || 2 + C ∑ (x,y) ∈ S max(0, 1 – y wx ) where the parameter C controls the tradeoff between choosing a large margin (small ||w||) and choosing a small hinge-loss. CS446 Machine Learning 37

L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Todays class Large

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

National Fair Funding Conference Autumn 2019 Tom Goldman Department for Education November 2019

Outcomes Analysis of a New Informatics Curriculum Hans-Ulrich Hei, Dean of Studies, School of

Week 3 -Modern scientific thought -Influences on Darwin's theory of natural selection -Natural

Dynamical bifurcations and singularly perturbed systems of differential equations Jacek Banasiak

Practical Relative Degree in SMC systems: Frequency Domain Approach Antonio Rosales, Leonid

Systems March 15-16, 2018 Why additional timing analysis for multi-voltage paths? In

Eat all you can in an all-you-can-eat buffet: A case for aggressive resource usage Ratul Mahajan

Surfing Incognito: Welfare Effects of Anonymous Shopping Seminar at the University of Minho

Sambuz

Useful Links

Newsletter

Mail Us

L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Todays class Large

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness &amp; Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

National Fair Funding Conference Autumn 2019 Tom Goldman Department for Education November 2019

Outcomes Analysis of a New Informatics Curriculum Hans-Ulrich Hei, Dean of Studies, School of

Week 3 -Modern scientific thought -Influences on Darwin's theory of natural selection -Natural

Dynamical bifurcations and singularly perturbed systems of differential equations Jacek Banasiak

Practical Relative Degree in SMC systems: Frequency Domain Approach **Antonio Rosales, **Leonid

Systems March 15-16, 2018 Why additional timing analysis for multi-voltage paths? In

Eat all you can in an all-you-can-eat buffet: A case for aggressive resource usage Ratul Mahajan

Surfing Incognito: Welfare Effects of Anonymous Shopping Seminar at the University of Minho

Sambuz

Useful Links

Newsletter

Mail Us

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and

Practical Relative Degree in SMC systems: Frequency Domain Approach Antonio Rosales, Leonid