Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 - - PowerPoint PPT Presentation

fragile separation
SMART_READER_LITE
LIVE PREVIEW

Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 - - PowerPoint PPT Presentation

Data Separation x 2 x 2 Class #13: Support Vector Machines (SVMs) and Kernel Functions x 1 x 1 Machine Learning (COMP 135): M. Allen, 02 Mar. 20 Linear classification with a perceptron or logistic function look for a dividing line in } the


slide-1
SLIDE 1

1

Class #13: Support Vector Machines (SVMs) and Kernel Functions

Machine Learning (COMP 135): M. Allen, 02 Mar. 20

1

Data Separation

}

Linear classification with a perceptron or logistic function look for a dividing line in the data (or a plane, or other linearly defined structure)

}

Often multiple lines are possible

}

Essentially, the algorithms are indifferent: they don’t care which line we pick

}

In the example seen here, either classification line separates data perfectly well Monday, 2 Mar. 2020 Machine Learning (COMP 135) 2

x1 x 2 x1 x 2

2

“Fragile” Separation

} As more data comes in, these classifiers may start to fail } A separator that is too close to one cluster or the other now makes mistakes } May happen even if new data follows same distribution seen in the training set

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 3

x1 x 2 x1 x 2

New data

3

“Robust” Separation

} What we want is a large margin separator: a separation that has the

largest distance possible from each part of our data-set

} This will often give much better performance when used on new data

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 4

x1 x 2 x1 x 2

4

slide-2
SLIDE 2

2 Large Margin Separation

} A new learning problem: find the separator with the largest margin } This will be measured from the data points, on opposite sides, that

are closest together

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 5

x1 x 2

This is sometimes called the “widest road” approach A support vector machine (SVM) is a technique that finds this road. The points that define the edges

  • f the road are knows as the

support vectors.

5

SVM Weight equation Threshold function

Linear Classifiers and SVMs

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 6

Linear Weight equation Threshold function w · x = w0 + w1x1 + w2x2 + · · · + wnxn hw = ( 1 w · x ≥ 0 w · x < 0 hw = ( +1 w · x ≥ 0 −1 w · x < 0 w · x + b = (w1x1 + w2x2 + · · · + wnxn) + b

6

Large Margin Separation

} Like a linear classifier, the SVM separates at the line where its learned

vector of weights is zero

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 7

x1 x 2

w · x + b = 0

+1 –1 A key difference: the SVM is going to do this without learning and remembering weight vector w. Instead, it will use features of the data-items themselves.

7

Mathematics of SVMs

} It turns out that the weight-vector w for the largest margin separator

has some important properties relative to the closest data-points on each side (x+ and x –)

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 8

x1 x 2

x + x –

w · x+ + b = +1 w · x− + b = −1 w · (x+ − x−) = 2 w ||w|| · (x+ − x−) = 2 ||w|| ||w|| = q w2

1 + w2 2 + · · · + w2 n

8

slide-3
SLIDE 3

3 Mathematics of SVMs

} Through the magic of mathematics (Lagrangian multipliers, to

be specific), we can derive a quadratic programming problem

1.

We start with our data-set:

2.

We then solve a constrained optimization problem:

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 9

{(x1, y1), (x2, y2), . . . , (xn, yn)} [ ∀i, yi ∈ {+1, −1} ] W(α) = X

i

αi − 1 2 X

i,j

αi αj yi yj(xi · xj) ∀i, αi ≥ 0 X

i

αi yi = 0

The goal: based on known values ( ) find the values we don’t know (𝛽i ) that: 1. Will maximize value of margin W(𝛽) 2. Satisfy the two numerical constraints

xi, yi 9

Mathematics of SVMs

} Although complex, a constrained optimization problem like this can be

algorithmically solved to get the 𝛽i values we want:

} Once done, we can find the weight-vector and bias term if we want:

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 10

W(α) = X

i

αi − 1 2 X

i,j

αi αj yi yj(xi · xj) ∀i, αi ≥ 0 X

i

αi yi = 0 w = X

i

αi yi xi b = −1 2( max

i | yi=−1 w · xi +

min

j | yj=+1 w · xj)

A note about notation: these equations involve two different, necessary products: 1. The usual application of weights to points: 2. Products of points and other points: w · xi = w1xi,1 + w2xi,2 + · · · + wnxi,n

<latexit sha1_base64="JvrWTP+V4MhX4CkPh1bvWi8oeuk=">ACNnicZVC7TsMwFHV4Q3kEGFksHhISqErCAsSgoURJApIbRU5jtNaOHbk3LSgKL8F7OzsDLABKwvMuAkMhStZOvf4nvs4QSJ4Co7zZI2Mjo1PTE5Nz9Rm5+YX7MWl81RlmrIGVULpy4CkTHDJGsBsMtEMxIHgl0EV0eD/4se0ylX8gxuEtaOSUfyiFMChvLt01ZMoBtEeb/ALRoqwL/EdeFzvI/7vouv/ZxvY7fAWyb1qtQbZKUiLWlZ0bLw7TWn7pSB/wP3B6wdrH/ePvRqXye+/dgKFc1iJoEKkqZN10mgnRMNnApWzLSylCWEXpEOy8uDC7xhqBHSpsnAZfsUJ1UB4pG5mEO21cy6TDJikVZsoExgUHniDQ64ZBXFjAKGam/mYdokmFIyDQ510Jli4jXsD20Ozq+goU9+NPbOvMcD9e+5/cO7V3Z26d2qcOERVTKEVtIo2kYt20QE6RieogSi6Q8/oDb1b9aL9Wq9V6Uj1o9mGQ2F9fENL4OuPg=</latexit>

xi · xj = xi,1xj,1 + xi,2xj,2 + · · · + xi,nxj,n

<latexit sha1_base64="eCN1fAxnt98afSPVlPhRGpC+A=">ACRHicZVFNS8QwE39/nbVo5fgBwguS1sPehFELx5XcFfBXUqaprvRNCnpdFK/5Z/QPTo3d+gN12PYtpdwdWBkMebeZOZFz8WPAHbfrHGxicmp6ZnZufmFxaXlisrq81EpZqyBlVC6UufJExwyRrAQbDLWDMS+YJd+DcnRf6ix3TClTyHu5i1I9KRPOSUgKG8iteKCHT9MLvNPY5bNFCAf1HX+BDfehmvOnlxX1exAbsDyh1SbsGUyuQnJYcpmXuVTbtml4H/A2cINo+2+vcPvfnPuld5bgWKphGTQAVJkivHjqGdEQ2cCpbPtdKExYTekA7LyvVzvG2oAIdKmyMBl+xInVRQrjuivkohPGhnXMYpMEkHbcJUYFC4cAoHXDMK4s4AQjU372PaJZpQMH6OdNKpYEV94pPCMysoqNMfTdyzbzGAOfvuv9B0605ezX3zDhxjAYxg9bRBtpBDtpHR+gU1VEDUfSIXtEH6ltP1pv1bvUHpWPWULOGRsL6+gY3orRB</latexit>

10

The Dual Formulation

} It turns out that we don’t need to use the weights at all } Instead, we can simply use the 𝛽i values directly:

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 11

w · xi + b = X

j

αjyj(xi · xj) + b

<latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit>

What we usually look for in a parametric method: the weights, w, and offset, b, defining the classifier What we can use instead: we compute an equivalent result based upon the 𝛽 parameters, the outputs y, and products between data-points themselves (along with the standard offset)

11

The Dual Formulation

} Now, if we had to sum over every data-point as on the

right-hand side of this equation, this would look very bad for a large data-set

} It turns out that these 𝛽i values have a special property,

however, that makes it feasible to use them as part of our classification function…

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 12

w · xi + b = X

j

αjyj(xi · xj) + b

<latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit>

12

slide-4
SLIDE 4

4 Sparseness of SVMs

} The 𝛽i values are 0 everywhere except at the support vectors

(the points closest to the separator)

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 13

x1 x 2

𝛽+ 𝛽 –

𝛽i = 0

So, when we do the calculation: We only have to sum over points xj that are in the set of support vectors, ignoring all others, since the related 𝛽 values are all 0. Thus, an SVM need only remember and use the values for a few support vectors, not those for all the rest of the data. X

j

αjyj(xi · xj) + b

<latexit sha1_base64="/YRoQt7c6kwui9tM+Q+a2hxE7w4=">ACHnicZVDLTsJAFJ3iA0RU1KWbicQEoyEtLtQd0Y3uMBEwoaSZTqcwMO01siIXyFH6A/ozthq4kf4/BYiJzkJidn7r1z3EjwWMwzW8jtba+sZnObGW3czu7e/n9g3osE0VZjUoh1ZNLYiZ4yGrAQbCnSDESuI13N7t9L3RZyrmMnyEQcRaAWmH3OeUgJac/LUdJ4HTxTYRUYdoMtBVtAMCHdcfPo8cjm3qScB/pO4pPsOuky+YJXMGvEqsBSlU7l9+KulGrurkx7YnaRKwEKgcdy0zAhaQ6KAU8FGWTuJWURoj7TZcGZshE+05GFfKl0h4Jm61BdKmBlZm4m4F+1hjyMEmAhna/xE4FB4mkG2OKURADTQhVXP+PaYcoQkEntbRJYJ57g/jdfTt4q21P2doKzv1QFY/+2uknq5ZF2Uyg86iRs0RwYdoWNURBa6RBV0h6qohih6Re9ojCbGm/FhfBqTeWvKWMwcoiUYX78Q76TP</latexit>

13

Hard and Soft Margins

}

We have slightly simplified one detail of how most SVMs actually work

}

It is not always true that the support vectors lie on the margins, with nothing else in between them

}

This is only true in the hard margin case

}

SVMs can have soft margins, instead (and usually do) to deal with noisier data

}

We weaken the requirement that no points lie between the margins

}

All points within the margins then become the support vectors for classification

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 14

14

Hard and Soft Margins

} Next lecture, we will see how to vary the margin strength in sklearn } SVM models come with a regularization parameter (C, as in the case of

classifiers) that can be:

1.

Increased to enforce a harder margin

2.

Decreased to allow a softer margin

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 15

15

Another Nice Trick

} The calculation uses dot-products of data-points with each

  • ther (instead of with weights)

} This will allow us to deal with data that is not linearly separable

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 16

x1 x 2

X

j

αjyj(xi · xj) + b

<latexit sha1_base64="/YRoQt7c6kwui9tM+Q+a2hxE7w4=">ACHnicZVDLTsJAFJ3iA0RU1KWbicQEoyEtLtQd0Y3uMBEwoaSZTqcwMO01siIXyFH6A/ozthq4kf4/BYiJzkJidn7r1z3EjwWMwzW8jtba+sZnObGW3czu7e/n9g3osE0VZjUoh1ZNLYiZ4yGrAQbCnSDESuI13N7t9L3RZyrmMnyEQcRaAWmH3OeUgJac/LUdJ4HTxTYRUYdoMtBVtAMCHdcfPo8cjm3qScB/pO4pPsOuky+YJXMGvEqsBSlU7l9+KulGrurkx7YnaRKwEKgcdy0zAhaQ6KAU8FGWTuJWURoj7TZcGZshE+05GFfKl0h4Jm61BdKmBlZm4m4F+1hjyMEmAhna/xE4FB4mkG2OKURADTQhVXP+PaYcoQkEntbRJYJ57g/jdfTt4q21P2doKzv1QFY/+2uknq5ZF2Uyg86iRs0RwYdoWNURBa6RBV0h6qohih6Re9ojCbGm/FhfBqTeWvKWMwcoiUYX78Q76TP</latexit>

Using a kernel “trick”, we can find a function that transforms the data into another form, where it is actually possible to separate it in a linear manner.

16

slide-5
SLIDE 5

5 Transforming Non-Separable Data

} If data that is not linearly separable, we can transform it } We change features used to represent our data } Really, we don’t care what the data feature are, so long as we can get

classification to work

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 17

x1 x 2

A transformation function: maps data-vectors to new vectors, of either the same dimensionality (m = n) or a different one (m ≠ n)

ϕ(x) ϕ : Rn → Rm 17

Transforming Non-Separable Data

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 18

x 2 x1

ϕ(x) ϕ : Rn → Rm

x 2 x1

ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( ) ϕ( )

X

j

αjyj(ϕ(xi) · ϕ(xj)) + b

<latexit sha1_base64="UtMAaPy9y4e2InGvyIQWS9E5Zs=">ACMHicbVDLSsNAFJ34aq2vqks3g0VoUpSF7osulHcVLCt0JQwmUyaSeZMLkpltKv8Ev0Q9yqIGi3foXTx6bqgQuHe89HTcWPAHTfDeWldW1zLZ9dzG5tb2Tn53r5HIVFWp1JIde+ShAkesTpwEOw+VoyErmBNt3c5qTf7TCVcRncwiFk7J2I+5wS0Cknf2Mnaeh0sU1EHBNBjqKdp+oOBFOyQuP7wYeTwErapJwH/U+uWSvgYu06+YJbNKfBfYs1JoXr9+FHNDdrTv7F9iRNQxYBFSRJWpYZQ3tIFHAq2ChnpwmLCe2RDhtOXx3hI53ysC+VjgjwNLugiyRMX1vobqXgn7eHPIpTYBGdjfFTgUHiSvY4pREANCFVc78c0IpQ0N4tTFKpYN4J7k8M9/StoiO1Pgr+l5tgPX73b+kUSlbp+XKrXbiAs2QRQfoEBWRhc5QFV2hGqojip7QK/pCY+PZeDM+jfFMumTMe/bRAozvHwvaq/E=</latexit>

18

The “Kernel Trick”

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 19

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 x2 x1 0.5 1 1.5 2 x1

2

0.5 1 1.5 2 2.5 x2

2

  • 3
  • 2
  • 1

1 2 3 √2x1x2

(a) (b)

ϕ(x1, x2) = (x2

1, x2 2,

√ 2x1x2)

Image source: Russel & Norvig, AI: A Modern Approach (Prentice Hal, 2010)

19

} We can derive a simpler (2-dimensional) equation, equivalent

to the cross-product needed when doing SVM computations in the transformed (3-dimensional) space: ϕ(x) · ϕ(z) = (x2

1, x2 2,

√ 2x1x2) · (z2

1, z2 2,

√ 2z1z2) = x2

1z2 1 + x2 2z2 2 +

√ 2x1x2 √ 2z1z2 = x2

1z2 1 + x2 2z2 2 + 2x1x2z1z2

= (x1z1 + x2z2)2 = (x · z)2

Simplifying the Transformation Function

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 20

10 multiplications 2 additions

3 multiplications 1 addition

Needed

Used instead

20

slide-6
SLIDE 6

6 The Kernel Function

} This final function (right side) is what the SVM will

actually use to compute dot-products in its equations

} This is called the kernel function } To make SVMs really useful we look for a kernel that: 1.

Separates the data usefully

2.

Is relatively efficient to calculate

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 21

k(x, z) = ϕ(x) · ϕ(z) = (x · z)2

21

Another Reason to Use Kernel Functions

} The SVM formulation generally uses dot-products of data-

points (perhaps run through some kernel) rather than the standard product of features and weights

} We have cases where the kernel-data approach is possible, but

the weights-based one is not

} Some useful kernels, that are easy to compute, correspond to

weight-equations applied to very high dimensional transforms of the original data

} In some common cases, equivalent weight-data vectors are infinite-

dimensional, and simply cannot be used in computation

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 22

w · xi + b = X

j

αjyj(ϕ(xi) · ϕ(xj)) + b

<latexit sha1_base64="ks58B37SjT4EK1DjyfZsY2024s8=">ACVHicZVHBSsNAEN2k1latWvXoZVGEFqUk9aAXoehFbxVsKzQlbDabZu0mGzaTain9Cv/Lm4Lgn+jBbatg7cDA483M25m3XiJ4Cpb1bpi5lfxqobi2vlHa3Nou7+y2U5kpylpUCqnuPZIywWPWAg6C3SeKkcgTrOMNrqb1zpCplMv4DkYJ60WkH/OAUwKacsvSiQiEXjB+nGCH+hLwL/E0cTk+xh6+wE6aRe4DdohIQqLBSGfFGRKVhLzyt7/6q7Fce6hWp2pu+dCqWbPAy8D+AYeNm+ePRqFTarlF8eXNItYDFSQNO3aVgK9MVHAqWCTdSdLWULogPTZeGbHB9pyseBVDpjwDN2oS+WMDt/YbqbQXDeG/M4yYDFdC4TZAKDxFPnsM8VoyBGhCquH4f05AoQkH7u6CkMsH8EzycfoqvdxV9qfvDqK731QbY/89dBu16zT6t1W+1E5doHkW0jw5QBdnoDXQNWqiFqLoFX0aK0beDO+zJyZn7eaxs/MHloIc+sbtSO1MQ=</latexit>

22

This Week & Next

} Topics: SVMs and kernels

} Readings linked from class website schedule page.

} Assignments:

} Project 01: due Monday, 09 March, 5:00 PM } Feature engineering and classification for image data } Midterm Exam: Wednesday, 11 March } Practice exam distributed by end of this week } Review session in class, Monday, 09 March

} Office Hours: 237 Halligan

} Monday, 10:30 AM – Noon } Tuesday, 9:00 AM – 1:00 PM } TA hours can be found on class website

Monday, 2 Mar. 2020 Machine Learning (COMP 135) 23

23