Retaining the Support Vectors Why Retain the Support Vectors? x 2 i - - PowerPoint PPT Presentation

▶

Jan 12, 2023 376 likes •446 views

Review: Support Vector Machines (SVMs) Start with labeled data-set: 1. { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } [ i, y i { +1 , 1 } ] Solve constrained quadratic optimization problem: 2. i 1 X X Maximize: W

SLIDE 1

1

Class #11: Kernel Functions & SVMs, II

Machine Learning (COMP 135): M. Allen, 09 Oct. 19

Review: Support Vector Machines (SVMs)

1. Start with labeled data-set:

2. Solve constrained quadratic optimization problem:

3. Derive necessary weights and biases for decision separator when and if needed:

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 2

{(x1, y1), (x2, y2), . . . , (xn, yn)} [ ∀i, yi ∈ {+1, −1} ] W(α) = X

αi − 1 2 X

i,j

αi αj yi yj(xi · xj) ∀i, αi ≥ 0 X

αi yi = 0

Maximize: while satisfying constraints: w = X

αi yi xi b = −1 2( max

i | yi=−1 w · xi +

min

j | yj=+1 w · xj)

Retaining the Support Vectors

} After computing the various optimizing 𝛽 values, the SVM

typically ends up with:

1. A large number of data points xi with 𝛽i = 0

2. A few special data points xj with 𝛽j ≠ 0

} These special points, the support vectors, can be used by

themselves to compute necessary weights and biases

} Often, the SVM keeps a list of these vectors, for computation

f later classification functions, rather than the weights defining

the classification boundary directly

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 3

Large amount of data Large amount of data

Why Retain the Support Vectors?

} The 𝛽i values are 0 everywhere except at the support vectors

(the points closest to the separator)

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 4

x1 x 2

𝛽i = 0

Sometimes, retaining only the support vectors comes in handy if we ever want to update the decision boundary as new data comes in for classification.

𝛽i = 0 𝛽j ≠ 0

SLIDE 2

2

Large amount of data Large amount of data

Why Retain the Support Vectors?

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 5

x1 x 2

𝛽i = 0

If the new data remains close to the old boundary, then we can compute the new 𝛽-values using

nly the new data and the

previous support vectors

𝛽i = 0 𝛽j ≠ 0 ??

Large amount of data Large amount of data

Why Retain the Support Vectors?

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 6

x1 x 2

𝛽i = 0

In such scenarios, the data for which 𝛽 = 0 before remains that way, and never needs to be reconsidered when solving the (compute-intensive) optimization step of the SVM

𝛽i = 0 𝛽j ≠ 0

Only three points have their 𝛽-values re-computed

Why Retain the Support Vectors?

} Another reason to retain vectors rather than weights is that

SVMs are often used with kernel functions that:

Transform the data

Compute necessary dot-products of points

} Furthermore, there are some popular such functions where

the data transform translates n-dimensional to m-dimensional data with n << m

} In such cases, storing the original n-dimensional data, and then

computing the transformation when necessary, can be much more efficient than trying to store the m-dimensional weight information

} This is especially true in cases where m = ∞ (!!) Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 7

k(x, z) = ϕ(x) · ϕ(z) (ϕ : Rn → Rm)

<latexit sha1_base64="W/ArN7Cg6WsO2G9XBhHqwZNG8U4=">ACYHicZVFNT+MwEHUCuxTYXVK4wcUCrdRqyrpHhYhIVw4QiIAhLuVo7jtFYdOziTihLlX/C7uHOEX4L7gUTpSJae37wZzyHqRQZ+P6L46sfvu+Vlnf2Pzx89eWV92+znRuGO8wLbW5DWnGpVC8AwIkv0Np0ko+U04PJ3kb0bcZEKrKxinvJvQvhKxYBQs1fOKY0kFAZhXDyUDdLAH7fHso6PMRlRkw7EJ0dExZpWMpM9OT+PqcRrs1zR/NmYXFZ/leYgP5MJPWed+A3/WngZRDMwUF7n/x5emPz3veM4k0yxOugEmaZXeBn0K3oAYEk7zcIHnGU8qGtM+LqTUl/m2pCMfa2KMAT9kFndIwtWKh+i6H+LBbCJXmwBWbtYlzie0KExdxJAxnIMcWUGaEfR+zATWUgfV6oZPJY8aeDT5oMjOKva6gdJy85rDQi+rsMrlvN4G+zdWGdOEGzqKA9tI9qKED/UBudoXPUQy9OmuO51SdN7fibrnVmdR15jU7aCHc3XesRLlj</latexit>

Pros and Cons of SVMs

} [+] Compared to linear classifiers like logistic regression, SVMs: 1.

Are insensitive to outliers in the data (extreme class examples)

2. Give a robust boundary for separable classes

3. Can handle high-dimensional data, via transformation

4. Can find optimal 𝛽-values, with no local maxima

} [–] Compared to linear classifiers like logistic regression, SVMs: 1.

Are less applicable in multi-class (c > 2) instances

2. Require more complex tuning, via hyper-parameter selection

3. May require some deep thinking or experimentation in order to select the appropriate kernel functions

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 8

SLIDE 3

3 Gaussian Radial Basis Function (RBF)

} A popular kernel

with many uses is the Gaussian RBF

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 9

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 10

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2 } The RBF is based on a

distance from a central focal point, z

} The can be measured in

a variety of ways, but is

ften Euclidean:

z

||x − z|| = v u u t

X

i=1

(xi − zi)2

<latexit sha1_base64="xH/XpF2Y5lr410kQlrN8GKYCiU=">ACMXicZVDLThsxFPVQSiE8GmDJxgJVAqlEM+mibBCo3SDBgkoEkJgw8ng8iYXHuxrBIz8A/0YFvQ/uoYdsOUn6kzCIuVKto6Pz732OWkpuIEwfAwmPkx+nPo0PdOYnZtf+NxcXDo2ymrKOlQJpU9TYpjgknWAg2CnpWakSAU7S9+Du5Prpg2XMkjuClZtyA9yXNOCXgqaR7ElOisGu0FgX6aV9cOb+K3w61zDm/j2FxqGJji6Ti25E7r6TD69cJ9LbhG+ct10ja6FrbAu/B5EI7C2u2+3dmbvfh8mzb9xpqgtmAQqiDFnUVhCtyIaOBXMNWJrWEnoBemxqvbq8BdPZThX2i8JuGbHdFJB7W2s+8xCvtWtuCwtMEmHY3IrMCg8iAVnXDMK4sYDQjX372PaJ5pQ8OGNTdJWsOwrvhoknvm/ip7y+n7RZnUA0f9234Pjdiv61mr/8kn8QMOaRitoFa2jCH1Hu2gPHaIOougePaBn9BL8CR6Dp+BlKJ0IRj3LaKyC13/d3azs</latexit>

Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 11

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

} The value of the

function is highest at point z itself

||x − z|| = 0 k(x, z) = e0 = 1 Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 12

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

} The value drops

to 0 as we get further from z

||x − z|| → ∞ k(x, z) → e−∞ = 0

SLIDE 4

4 Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 13

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

} 𝜏 controls the

diameter of the non-zero area

Tuning Parameter

Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 14

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

} If 𝜏 gets larger, the

non-0 area will become wider

σ → ∞ k(x, z) → e0 = 1

Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 15

Image source: https://www.cs.toronto.edu/~duvenaud/cookbook/

k(x, z) = e− ||x−z||2

2σ2

} If 𝜏 gets smaller,

non-0 area will become narrower

σ → 0 k(x, z) → e−∞ = 0

Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 16

k(x, z) = e− ||x−z||2

2σ2

} The radius around the focal point z at which the function

becomes 0 corresponds to the decision boundary in our data x1 x 2

SLIDE 5

5 Gaussian Radial Basis Function

Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 17

k(x, z1, z2, z3) =

3

X

j=1

e− ||x−zj||2

2σ2

} We can deal with multiple clusters in the data by using a

combination of multiple RBFs

x1 x 2

Next Week

} Topics: SVMs and Feature Engineering } Meetings: Tuesday and Wednesday, usual time } Readings: Linked from class website schedule page

} Includes original paper (Brown, et al.) for discussion

} Homework 03: due Wednesday, 16 October, 9:00 AM } Project 01: out Tuesday; due Monday, 04 November, 9:00 AM } Office Hours: 237 Halligan, Tuesday, 11:00 AM – 1:00 PM

} TA hours can be found on class website as well Wednesday, 9 Oct. 2019 Machine Learning (COMP 135) 18