Non-Parametric Methods and Support Vector Machines Shan-Hung Wu - - PowerPoint PPT Presentation

non parametric methods and support vector machines
SMART_READER_LITE
LIVE PREVIEW

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu - - PowerPoint PPT Presentation

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine


slide-1
SLIDE 1

Non-Parametric Methods and Support Vector Machines

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 1 / 42

slide-2
SLIDE 2

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 2 / 42

slide-3
SLIDE 3

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 3 / 42

slide-4
SLIDE 4

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 4 / 42

slide-5
SLIDE 5

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-6
SLIDE 6

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-7
SLIDE 7

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

2

Find the K nearest neighbors of a given point x

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-8
SLIDE 8

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

2

Find the K nearest neighbors of a given point x

3

Predict the label of x by the majority vote (in classification) or average (in regression) of NNs’ labels

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-9
SLIDE 9

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

2

Find the K nearest neighbors of a given point x

3

Predict the label of x by the majority vote (in classification) or average (in regression) of NNs’ labels Distance metric?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-10
SLIDE 10

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

2

Find the K nearest neighbors of a given point x

3

Predict the label of x by the majority vote (in classification) or average (in regression) of NNs’ labels Distance metric? E.g., Euclidean distance d(x(i),x) = kx(i) xk Training algorithm?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-11
SLIDE 11

K-NN Methods I

The K-nearest neighbor (K-NN) methods are straightforward, but a fundamentally different way, to predict the label of a data point x:

1

Choose the number K and a distance metric

2

Find the K nearest neighbors of a given point x

3

Predict the label of x by the majority vote (in classification) or average (in regression) of NNs’ labels Distance metric? E.g., Euclidean distance d(x(i),x) = kx(i) xk Training algorithm? Simply “remember” X in storage

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 5 / 42

slide-12
SLIDE 12

K-NN Methods II

Could be very complex K is a hyperparameter controlling the model complexity

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 6 / 42

slide-13
SLIDE 13

Non-Parametric Methods

K-NN method is a special case of non-parametric (or memory-based) methods

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 7 / 42

slide-14
SLIDE 14

Non-Parametric Methods

K-NN method is a special case of non-parametric (or memory-based) methods

Non-parametric in the sense that f are not described by only few parameters

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 7 / 42

slide-15
SLIDE 15

Non-Parametric Methods

K-NN method is a special case of non-parametric (or memory-based) methods

Non-parametric in the sense that f are not described by only few parameters Memory-based in that all data (rather than just parameters) need to be memorized during the training process

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 7 / 42

slide-16
SLIDE 16

Non-Parametric Methods

K-NN method is a special case of non-parametric (or memory-based) methods

Non-parametric in the sense that f are not described by only few parameters Memory-based in that all data (rather than just parameters) need to be memorized during the training process

K-NN is also a lazy method since the prediction function f is obtained

  • nly before the prediction

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 7 / 42

slide-17
SLIDE 17

Non-Parametric Methods

K-NN method is a special case of non-parametric (or memory-based) methods

Non-parametric in the sense that f are not described by only few parameters Memory-based in that all data (rather than just parameters) need to be memorized during the training process

K-NN is also a lazy method since the prediction function f is obtained

  • nly before the prediction

Motivates the development of other local models

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 7 / 42

slide-18
SLIDE 18

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-19
SLIDE 19

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-20
SLIDE 20

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-21
SLIDE 21

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Cons:

Storage demanding

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-22
SLIDE 22

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Cons:

Storage demanding Sensitive to outliers

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-23
SLIDE 23

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Cons:

Storage demanding Sensitive to outliers Sensitive to irrelevant data features (vs. decision trees)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-24
SLIDE 24

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Cons:

Storage demanding Sensitive to outliers Sensitive to irrelevant data features (vs. decision trees) Needs to deal with missing data (e.g., special distances)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-25
SLIDE 25

Pros & Cons

Pros:

Almost no assumption on f other than smoothness

High capacity/complexity High accuracy given a large training set

Supports online training (by simply memorizing) Readily extensible to multi-class and regression problems

Cons:

Storage demanding Sensitive to outliers Sensitive to irrelevant data features (vs. decision trees) Needs to deal with missing data (e.g., special distances) Computationally expensive: O(ND) time for making each prediction

Can speed up with index and/or approximation

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 8 / 42

slide-26
SLIDE 26

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 9 / 42

slide-27
SLIDE 27

Parzen Windows and Kernels

Binary KNN classifier: f(x) = sign ⇣

∑i:x(i)2KNN(x)y(i)⌘

The “radius” of voter boundary depends on the input x

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

slide-28
SLIDE 28

Parzen Windows and Kernels

Binary KNN classifier: f(x) = sign ⇣

∑i:x(i)2KNN(x)y(i)⌘

The “radius” of voter boundary depends on the input x

We can instead use the Parzen window with a fixed radius: f(x) = sign ⇣ ∑i y(i)1(x(i);kx(i) xk  R) ⌘

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

slide-29
SLIDE 29

Parzen Windows and Kernels

Binary KNN classifier: f(x) = sign ⇣

∑i:x(i)2KNN(x)y(i)⌘

The “radius” of voter boundary depends on the input x

We can instead use the Parzen window with a fixed radius: f(x) = sign ⇣ ∑i y(i)1(x(i);kx(i) xk  R) ⌘ Parzen windows also replace the hard boundary with a soft one: f(x) = sign ⇣ ∑i y(i)k(x(i),x) ⌘

k(x(i),x) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

slide-30
SLIDE 30

Parzen Windows and Kernels

Binary KNN classifier: f(x) = sign ⇣

∑i:x(i)2KNN(x)y(i)⌘

The “radius” of voter boundary depends on the input x

We can instead use the Parzen window with a fixed radius: f(x) = sign ⇣ ∑i y(i)1(x(i);kx(i) xk  R) ⌘ Parzen windows also replace the hard boundary with a soft one: f(x) = sign ⇣ ∑i y(i)k(x(i),x) ⌘

k(x(i),x) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

slide-31
SLIDE 31

Common RBF Kernels

How to act like soft K-NN?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

slide-32
SLIDE 32

Common RBF Kernels

How to act like soft K-NN? Gaussian RBF kernel: k(x(i),x) = N (x(i) x;0,σ2I)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

slide-33
SLIDE 33

Common RBF Kernels

How to act like soft K-NN? Gaussian RBF kernel: k(x(i),x) = N (x(i) x;0,σ2I) Or simply k(x(i),x) = exp ⇣ γkx(i) xk2⌘ γ 0 (or σ2) is a hyperparameter controlling the smoothness of f

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

slide-34
SLIDE 34

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 12 / 42

slide-35
SLIDE 35

Locally Weighted Linear Regression

In addition to the majority voting and average, we can define local models for lazy predictions

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

slide-36
SLIDE 36

Locally Weighted Linear Regression

In addition to the majority voting and average, we can define local models for lazy predictions E.g., in (eager) linear regression, we find w 2 RD+1 that minimizes SSE: argmin

w ∑ i

(y(i) w>x(i))2 Local model: to find w minimizing SSE local to the point x we want to predict: argmin

w ∑ i

k(x(i),x)(y(i) w>x(i))2

k(·,·) 2 R is an RBF kernel

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

slide-37
SLIDE 37

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 14 / 42

slide-38
SLIDE 38

Kernel Machines

Kernel machines: f(x) =

N

i=1

cik(x(i),x)+c0 For example:

Parzen windows: ci = y(i) and c0 = 0 Locally weighted linear regression: ci = (y(i) w>x(i))2 and c0 = 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

slide-39
SLIDE 39

Kernel Machines

Kernel machines: f(x) =

N

i=1

cik(x(i),x)+c0 For example:

Parzen windows: ci = y(i) and c0 = 0 Locally weighted linear regression: ci = (y(i) w>x(i))2 and c0 = 0

The variable c 2 RN can be learned in either an eager or lazy manner Pros: complex, but highly accurate if regularized well

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

slide-40
SLIDE 40

Sparse Kernel Machines

To make a prediction, we need to store all examples May be infeasible due to

Large dataset (N) Time limit Space limit

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

slide-41
SLIDE 41

Sparse Kernel Machines

To make a prediction, we need to store all examples May be infeasible due to

Large dataset (N) Time limit Space limit

Can we make c sparse?

I.e., to make ci 6= 0 for only a small fraction of examples called support vectors

How?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

slide-42
SLIDE 42

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 17 / 42

slide-43
SLIDE 43

Separating Hyperplane I

Model: F = {f : f(x;w,b) = w>x+b}

A collection of hyperplanes

Prediction: ˆ y = sign(f(x))

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

slide-44
SLIDE 44

Separating Hyperplane I

Model: F = {f : f(x;w,b) = w>x+b}

A collection of hyperplanes

Prediction: ˆ y = sign(f(x)) Training: to find w and b such that w>x(i) +b 0, if y(i) = 1 w>x(i) +b  0, if y(i) = 1

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

slide-45
SLIDE 45

Separating Hyperplane I

Model: F = {f : f(x;w,b) = w>x+b}

A collection of hyperplanes

Prediction: ˆ y = sign(f(x)) Training: to find w and b such that w>x(i) +b 0, if y(i) = 1 w>x(i) +b  0, if y(i) = 1

  • r simply

y(i)(w>x(i) +b) 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

slide-46
SLIDE 46

Separating Hyperplane II

There are many feasible w’s and b’s when the classes are linearly separable Which hyperplane is the best?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 19 / 42

slide-47
SLIDE 47

Support Vector Classification

Support vector classifier (SVC) picks one with largest margin:

y(i)(w>x(i) +b) a for all i Margin: 2a/kwk [Homework]

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

slide-48
SLIDE 48

Support Vector Classification

Support vector classifier (SVC) picks one with largest margin:

y(i)(w>x(i) +b) a for all i Margin: 2a/kwk [Homework]

With loss of generality, we let a = 1 and solve the problem: argminw,b 1

2kwk2

sibject to y(i)(w>x(i) +b) 1,8i

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

slide-49
SLIDE 49

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 21 / 42

slide-50
SLIDE 50

Overlapping Classes

In practice, classes may be overlapping

Due to, e.g., noises or outliers

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

slide-51
SLIDE 51

Overlapping Classes

In practice, classes may be overlapping

Due to, e.g., noises or outliers

The problem argminw,b 1

2kwk2

sibject to y(i)(w>x(i) +b) 1,8i has no solution in this case. How to fix this?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

slide-52
SLIDE 52

Slacks

SVC tolerates slacks that fall outside of the regions they ought to be Problem: argminw,b,ξ

1 2kwk2+C∑N i=1 ξi

sibject to y(i)(w>x(i) +b) 1ξi and ξi 0,8i

Favors large margin but also fewer slacks

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 23 / 42

slide-53
SLIDE 53

Hyperparameter C

argminw,b,ξ

1 2kwk2 +C∑N i=1 ξi

The hyperparameter C controls the tradeoff between

Maximizing margin Minimizing number of slacks

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

slide-54
SLIDE 54

Hyperparameter C

argminw,b,ξ

1 2kwk2 +C∑N i=1 ξi

The hyperparameter C controls the tradeoff between

Maximizing margin Minimizing number of slacks

Provides a geometric explanation to the weight decay

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

slide-55
SLIDE 55

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 25 / 42

slide-56
SLIDE 56

Nonlinearly Separable Classes

In practice, classes may be nonlinearly separable

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

slide-57
SLIDE 57

Nonlinearly Separable Classes

In practice, classes may be nonlinearly separable SVC (with slacks) gives “bad” hyperplanes due to underfitting How to make it nonlinear?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

slide-58
SLIDE 58

Feature Augmentation

Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

slide-59
SLIDE 59

Feature Augmentation

Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear We can can define a function Φ(·) that maps each data point to a high dimensional space: argminw,b,ξ

1 2kwk2 +C∑i ξi

sibject to y(i)(w>Φ(x(i))+b) 1ξi and ξi 0,8i

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

slide-60
SLIDE 60

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 28 / 42

slide-61
SLIDE 61

Time Complexity

Nonlinear SVC: argminw,b,ξ

1 2kwk2 +C∑i ξi

sibject to y(i)(w>Φ(x(i))+b) 1ξi and ξi 0,8i The higher augmented feature dimension, the more variables in w to solve

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

slide-62
SLIDE 62

Time Complexity

Nonlinear SVC: argminw,b,ξ

1 2kwk2 +C∑i ξi

sibject to y(i)(w>Φ(x(i))+b) 1ξi and ξi 0,8i The higher augmented feature dimension, the more variables in w to solve Can we solve w in time complexity that is independent with the mapped dimension?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

slide-63
SLIDE 63

Dual Problem

Primal problem: argminw,b,ξ

1 2kwk2 +C∑i ξi

sibject to y(i)(w>Φ(x(i))+b) 1ξi and ξi 0,8i Dual problem: argmaxα,β minw,b,ξ L(w,b,ξ,α,β) subject to α 0,β 0 where L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

slide-64
SLIDE 64

Dual Problem

Primal problem: argminw,b,ξ

1 2kwk2 +C∑i ξi

sibject to y(i)(w>Φ(x(i))+b) 1ξi and ξi 0,8i Dual problem: argmaxα,β minw,b,ξ L(w,b,ξ,α,β) subject to α 0,β 0 where L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Primal problem is convex, so strong duality holds

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

slide-65
SLIDE 65

Solving Dual Problem I

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

The inner problem min

w,b,ξ L(w,b,ξ,α,β)

is convex in terms of w, b, and ξ Let’s solve it analytically:

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

slide-66
SLIDE 66

Solving Dual Problem I

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

The inner problem min

w,b,ξ L(w,b,ξ,α,β)

is convex in terms of w, b, and ξ Let’s solve it analytically:

∂L ∂w = w∑i αiy(i)Φ(x(i)) = 0 ) w = ∑i αiy(i)Φ(x(i))

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

slide-67
SLIDE 67

Solving Dual Problem I

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

The inner problem min

w,b,ξ L(w,b,ξ,α,β)

is convex in terms of w, b, and ξ Let’s solve it analytically:

∂L ∂w = w∑i αiy(i)Φ(x(i)) = 0 ) w = ∑i αiy(i)Φ(x(i)) ∂L ∂b = ∑i αiy(i) = 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

slide-68
SLIDE 68

Solving Dual Problem I

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

The inner problem min

w,b,ξ L(w,b,ξ,α,β)

is convex in terms of w, b, and ξ Let’s solve it analytically:

∂L ∂w = w∑i αiy(i)Φ(x(i)) = 0 ) w = ∑i αiy(i)Φ(x(i)) ∂L ∂b = ∑i αiy(i) = 0 ∂L ∂ξi = C αi βi = 0 ) βi = C αi

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

slide-69
SLIDE 69

Solving Dual Problem II

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Substituting w = ∑i αiy(i)Φ(x(i)) and βi = C αi in L(w,b,ξ,α,β): L(w,b,ξ,α,β) = ∑

i

αi 1 2 ∑

i,j

αiαjy(i)y(j)Φ(x(i))>Φ(x(j))b∑

i

αiy(i),

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

slide-70
SLIDE 70

Solving Dual Problem II

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Substituting w = ∑i αiy(i)Φ(x(i)) and βi = C αi in L(w,b,ξ,α,β): L(w,b,ξ,α,β) = ∑

i

αi 1 2 ∑

i,j

αiαjy(i)y(j)Φ(x(i))>Φ(x(j))b∑

i

αiy(i), min

w,b,ξ L(w,b,ξ,α,β) =

8 > > < > > : ∑i αi 1

2 ∑i,j αiαjy(i)y(j)Φ(x(i))>Φ(x(j)) ,

if ∑i αiy(i) = 0, ∞,

  • therwise

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

slide-71
SLIDE 71

Solving Dual Problem II

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Substituting w = ∑i αiy(i)Φ(x(i)) and βi = C αi in L(w,b,ξ,α,β): L(w,b,ξ,α,β) = ∑

i

αi 1 2 ∑

i,j

αiαjy(i)y(j)Φ(x(i))>Φ(x(j))b∑

i

αiy(i), min

w,b,ξ L(w,b,ξ,α,β) =

8 > > < > > : ∑i αi 1

2 ∑i,j αiαjy(i)y(j)Φ(x(i))>Φ(x(j)) ,

if ∑i αiy(i) = 0, ∞,

  • therwise

Outer maximization problem: argmaxα 1>α 1

2α>Kα

subject to 0  α  C1 and y>α = 0

Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j))

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

slide-72
SLIDE 72

Solving Dual Problem II

L(w,b,ξ,α,β) =

1 2kwk2 +C∑i ξt+∑i αi(1y(i)(w>Φ(x(i))+b)ξi)+∑i βi(ξi)

Substituting w = ∑i αiy(i)Φ(x(i)) and βi = C αi in L(w,b,ξ,α,β): L(w,b,ξ,α,β) = ∑

i

αi 1 2 ∑

i,j

αiαjy(i)y(j)Φ(x(i))>Φ(x(j))b∑

i

αiy(i), min

w,b,ξ L(w,b,ξ,α,β) =

8 > > < > > : ∑i αi 1

2 ∑i,j αiαjy(i)y(j)Φ(x(i))>Φ(x(j)) ,

if ∑i αiy(i) = 0, ∞,

  • therwise

Outer maximization problem: argmaxα 1>α 1

2α>Kα

subject to 0  α  C1 and y>α = 0

Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) βi = C αi 0 implies αi  C

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

slide-73
SLIDE 73

Solving Dual Problem II

Dual minimization problem of SVC: argminα 1

2α>Kα 1>α

subject to 0  α  C1 and y>α = 0 Number of variables to solve?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

slide-74
SLIDE 74

Solving Dual Problem II

Dual minimization problem of SVC: argminα 1

2α>Kα 1>α

subject to 0  α  C1 and y>α = 0 Number of variables to solve? N instead of augmented feature dimension

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

slide-75
SLIDE 75

Solving Dual Problem II

Dual minimization problem of SVC: argminα 1

2α>Kα 1>α

subject to 0  α  C1 and y>α = 0 Number of variables to solve? N instead of augmented feature dimension In practice, this problem is solved by specialized solvers such as the sequential minimal optimization (SMO) [3]

As K is usually ill-conditioned

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

slide-76
SLIDE 76

Making Predictions

Prediction: ˆ y = sign(f(x)) = sign(w>x+b)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

slide-77
SLIDE 77

Making Predictions

Prediction: ˆ y = sign(f(x)) = sign(w>x+b) We have w = ∑i αiy(i)Φ(x(i)) How to obtain b?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

slide-78
SLIDE 78

Making Predictions

Prediction: ˆ y = sign(f(x)) = sign(w>x+b) We have w = ∑i αiy(i)Φ(x(i)) How to obtain b? By the complementary slackness of KKT conditions, we have: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

slide-79
SLIDE 79

Making Predictions

Prediction: ˆ y = sign(f(x)) = sign(w>x+b) We have w = ∑i αiy(i)Φ(x(i)) How to obtain b? By the complementary slackness of KKT conditions, we have: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0 For any x(i) having 0 < αi < C, we have βi = C αi > 0 ) ξi = 0, (1y(i)(w>Φ(x(i))+b)ξi) = 0 ) b = y(i) w>Φ(x(i))

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

slide-80
SLIDE 80

Making Predictions

Prediction: ˆ y = sign(f(x)) = sign(w>x+b) We have w = ∑i αiy(i)Φ(x(i)) How to obtain b? By the complementary slackness of KKT conditions, we have: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0 For any x(i) having 0 < αi < C, we have βi = C αi > 0 ) ξi = 0, (1y(i)(w>Φ(x(i))+b)ξi) = 0 ) b = y(i) w>Φ(x(i)) In practice, we usually take the average over all x(i)’s having 0 < αi < C to avoid numeric error

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

slide-81
SLIDE 81

Outline

1

Non-Parametric Methods K-NN Parzen Windows Local Models

2

Support Vector Machines SVC Slacks Nonlinear SVC Dual Problem Kernel Trick

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 35 / 42

slide-82
SLIDE 82

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-83
SLIDE 83

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Time complexity?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-84
SLIDE 84

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Time complexity? If we choose Φ carefully, we can can evaluate Φ(x(i))>Φ(x) = k(x(i),x) efficiently

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-85
SLIDE 85

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Time complexity? If we choose Φ carefully, we can can evaluate Φ(x(i))>Φ(x) = k(x(i),x) efficiently Polynomial kernel: k(a,b) = (a>b/α +β)γ

E.g., let α = 1, β = 1, γ = 2 and a 2 R2, then Φ(a) = [1, p 2a1, p 2a2,a2

1,a2 2,

p 2a1a2]> 2 R6

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-86
SLIDE 86

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Time complexity? If we choose Φ carefully, we can can evaluate Φ(x(i))>Φ(x) = k(x(i),x) efficiently Polynomial kernel: k(a,b) = (a>b/α +β)γ

E.g., let α = 1, β = 1, γ = 2 and a 2 R2, then Φ(a) = [1, p 2a1, p 2a2,a2

1,a2 2,

p 2a1a2]> 2 R6

Gaussian RBF kernel: k(a,b) = exp(γkabk2) , γ 0

k(a,b) = exp(γkak2 +2γa>bγkbk2)= exp(γkak2 γkbk2)(1+ 2γa>b

1!

+ (2γa>b)2

2!

+···) Let a 2 R2, then Φ(a) = exp(γkak2)[1, q

2γ 1! a1,

q

2γ 1! a2,

q

2γ 2! a2 1,

q

2γ 2! a2 2,2

q

γ 2!a1a2,···]> 2 R∞

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-87
SLIDE 87

Kernel as Inner Product

We need to evaluate Φ(x(i))>Φ(x(j)) when

Solving dual problem of SVC, where Ki,j = y(i)y(j)Φ(x(i))>Φ(x(j)) Making a prediction, where f(x) = w>x+b = ∑i αiy(i)Φ(x(i))>Φ(x)+b

Time complexity? If we choose Φ carefully, we can can evaluate Φ(x(i))>Φ(x) = k(x(i),x) efficiently Polynomial kernel: k(a,b) = (a>b/α +β)γ

E.g., let α = 1, β = 1, γ = 2 and a 2 R2, then Φ(a) = [1, p 2a1, p 2a2,a2

1,a2 2,

p 2a1a2]> 2 R6

Gaussian RBF kernel: k(a,b) = exp(γkabk2) , γ 0

k(a,b) = exp(γkak2 +2γa>bγkbk2)= exp(γkak2 γkbk2)(1+ 2γa>b

1!

+ (2γa>b)2

2!

+···) Let a 2 R2, then Φ(a) = exp(γkak2)[1, q

2γ 1! a1,

q

2γ 1! a2,

q

2γ 2! a2 1,

q

2γ 2! a2 2,2

q

γ 2!a1a2,···]> 2 R∞

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

slide-88
SLIDE 88

Kernel Trick

If we choose Φ induced by Polynomial or Gaussian RBF kernel, then Ki,j = y(i)y(j)k(x(i),x) takes only O(D) time to evaluate, and f(x) = ∑

i

αiy(i)k(x(i),x)+b takes O(ND) time

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

slide-89
SLIDE 89

Kernel Trick

If we choose Φ induced by Polynomial or Gaussian RBF kernel, then Ki,j = y(i)y(j)k(x(i),x) takes only O(D) time to evaluate, and f(x) = ∑

i

αiy(i)k(x(i),x)+b takes O(ND) time Independent with the augmented feature dimension

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

slide-90
SLIDE 90

Kernel Trick

If we choose Φ induced by Polynomial or Gaussian RBF kernel, then Ki,j = y(i)y(j)k(x(i),x) takes only O(D) time to evaluate, and f(x) = ∑

i

αiy(i)k(x(i),x)+b takes O(ND) time Independent with the augmented feature dimension α, β, and γ are new hyperparameters

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

slide-91
SLIDE 91

Sparse Kernel Machines

SVC is a kernel machine: f(x) = ∑

i

αiy(i)k(x(i),x)+b It is surprising that SVC works like K-NN in some sense

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

slide-92
SLIDE 92

Sparse Kernel Machines

SVC is a kernel machine: f(x) = ∑

i

αiy(i)k(x(i),x)+b It is surprising that SVC works like K-NN in some sense However, SVC is a sparse kernel machine Only the slacks become the support vectors (αi > 0)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

slide-93
SLIDE 93

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-94
SLIDE 94

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be:

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-95
SLIDE 95

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be:

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-96
SLIDE 96

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be: Non SVs (αi = 0): y(i)(w>Φ(x(i))+b) 1 (usually strict) Free SVs (0 < αi < C): y(i)(w>Φ(x(i))+b) = 1 Bounded SVs (αi = C): y(i)(w>Φ(x(i))+b)  1 (usually strict)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-97
SLIDE 97

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be: Non SVs (αi = 0): y(i)(w>Φ(x(i))+b) 1 (usually strict)

1y(i)(w>Φ(x(i))+b)ξi  0 Since βi = C αi 6= 0, we have ξi = 0

Free SVs (0 < αi < C): y(i)(w>Φ(x(i))+b) = 1 Bounded SVs (αi = C): y(i)(w>Φ(x(i))+b)  1 (usually strict)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-98
SLIDE 98

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be: Non SVs (αi = 0): y(i)(w>Φ(x(i))+b) 1 (usually strict)

1y(i)(w>Φ(x(i))+b)ξi  0 Since βi = C αi 6= 0, we have ξi = 0

Free SVs (0 < αi < C): y(i)(w>Φ(x(i))+b) = 1

1y(i)(w>Φ(x(i))+b)ξi = 0 Since βi = C αi 6= 0, we have ξi = 0

Bounded SVs (αi = C): y(i)(w>Φ(x(i))+b)  1 (usually strict)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-99
SLIDE 99

KKT Conditions and Types of SVs

By KKT conditions, we have:

Primal feasibility: y(i)(w>Φ(x(i))+b) 1ξi and ξi 0 Complementary slackness: αi(1y(i)(w>Φ(x(i))+b)ξi) = 0 and βi(ξi) = 0

Depending on the value of αi, each example x(i) can be: Non SVs (αi = 0): y(i)(w>Φ(x(i))+b) 1 (usually strict)

1y(i)(w>Φ(x(i))+b)ξi  0 Since βi = C αi 6= 0, we have ξi = 0

Free SVs (0 < αi < C): y(i)(w>Φ(x(i))+b) = 1

1y(i)(w>Φ(x(i))+b)ξi = 0 Since βi = C αi 6= 0, we have ξi = 0

Bounded SVs (αi = C): y(i)(w>Φ(x(i))+b)  1 (usually strict)

1y(i)(w>Φ(x(i))+b)ξi = 0 Since βi = 0, we have ξi 0

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

slide-100
SLIDE 100

Remarks I

Pros of SVC:

Global optimality (convex problem)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-101
SLIDE 101

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-102
SLIDE 102

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-103
SLIDE 103

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Cons:

Nonlinear SVC not scalable to large tasks

Takes O(N2) ⇠ O(N3) time to train using SMO in LIBSVM [1] On the other hand, linear SVC takes O(ND) time

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-104
SLIDE 104

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Cons:

Nonlinear SVC not scalable to large tasks

Takes O(N2) ⇠ O(N3) time to train using SMO in LIBSVM [1] On the other hand, linear SVC takes O(ND) time

Kernel matrix K requires O(N2) space

In practice, we cache only a small portion of K in memory

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-105
SLIDE 105

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Cons:

Nonlinear SVC not scalable to large tasks

Takes O(N2) ⇠ O(N3) time to train using SMO in LIBSVM [1] On the other hand, linear SVC takes O(ND) time

Kernel matrix K requires O(N2) space

In practice, we cache only a small portion of K in memory

Sensitive to irrelevant data features (vs. decision trees)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-106
SLIDE 106

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Cons:

Nonlinear SVC not scalable to large tasks

Takes O(N2) ⇠ O(N3) time to train using SMO in LIBSVM [1] On the other hand, linear SVC takes O(ND) time

Kernel matrix K requires O(N2) space

In practice, we cache only a small portion of K in memory

Sensitive to irrelevant data features (vs. decision trees) Non-trivial hyperparameter tuning

The effect of a (C, γ) combination is unknown in advance Usually done by grid search

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-107
SLIDE 107

Remarks I

Pros of SVC:

Global optimality (convex problem) Works with different kernels (linear, Polynomial, Gaussian RBF, etc.) Works well with small training set

Cons:

Nonlinear SVC not scalable to large tasks

Takes O(N2) ⇠ O(N3) time to train using SMO in LIBSVM [1] On the other hand, linear SVC takes O(ND) time

Kernel matrix K requires O(N2) space

In practice, we cache only a small portion of K in memory

Sensitive to irrelevant data features (vs. decision trees) Non-trivial hyperparameter tuning

The effect of a (C, γ) combination is unknown in advance Usually done by grid search

Separate only 2 classes

Usually wrapped by the 1-vs-1 technique for multi-class classification

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

slide-108
SLIDE 108

Remarks II

Does nonlinear SVC always perform better than linear SVC?

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 41 / 42

slide-109
SLIDE 109

Remarks II

Does nonlinear SVC always perform better than linear SVC? No Choose linear SVC (e.g., LIBLINEAR [2]) when

N is large (since nonlinear SVC does not scale), or D is large (since classes may already be linearly separable)

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 41 / 42

slide-110
SLIDE 110

Reference I

[1] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [2] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008. [3] John Platt et al. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998.

Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 42 / 42