fragile separation
play

Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 - PowerPoint PPT Presentation

Data Separation x 2 x 2 Class #13: Support Vector Machines (SVMs) and Kernel Functions x 1 x 1 Machine Learning (COMP 135): M. Allen, 02 Mar. 20 Linear classification with a perceptron or logistic function look for a dividing line in } the


  1. Data Separation x 2 x 2 Class #13: Support Vector Machines (SVMs) and Kernel Functions x 1 x 1 Machine Learning (COMP 135): M. Allen, 02 Mar. 20 Linear classification with a perceptron or logistic function look for a dividing line in } the data (or a plane, or other linearly defined structure) Often multiple lines are possible } Essentially, the algorithms are indifferent : they don’t care which line we pick } In the example seen here, either classification line separates data perfectly well } 2 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 1 2 “Fragile” Separation “Robust” Separation x 2 x 2 x 2 x 2 New data x 1 x 1 x 1 x 1 } What we want is a large margin separator: a separation that has the } As more data comes in, these classifiers may start to fail largest distance possible from each part of our data-set } A separator that is too close to one cluster or the other now makes mistakes } May happen even if new data follows same distribution seen in the training set } This will often give much better performance when used on new data 4 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 3 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 3 4 1

  2. Large Margin Separation Linear Classifiers and SVMs x 2 Linear This is sometimes called the Weight equation w · x = w 0 + w 1 x 1 + w 2 x 2 + · · · + w n x n “widest road” approach ( 1 w · x ≥ 0 Threshold function h w = A support vector machine (SVM) 0 w · x < 0 is a technique that finds this road. The points that define the edges SVM of the road are knows as the support vectors. x 1 Weight equation w · x + b = ( w 1 x 1 + w 2 x 2 + · · · + w n x n ) + b } A new learning problem: find the separator with the largest margin ( +1 w · x ≥ 0 } This will be measured from the data points, on opposite sides, that Threshold function h w = w · x < 0 − 1 are closest together 6 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 5 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 5 6 Large Margin Separation Mathematics of SVMs x 2 x 2 w · x + + b = +1 w · x − + b = − 1 w · x + b = 0 +1 x + w · ( x + − x − ) = 2 A key difference: the SVM is going –1 2 to do this without learning and w || w || · ( x + − x − ) = remembering weight vector w . || w || x – Instead, it will use features of the data-items themselves . q || w || = w 2 1 + w 2 2 + · · · + w 2 x 1 x 1 n } Like a linear classifier, the SVM separates at the line where its learned } It turns out that the weight-vector w for the largest margin separator vector of weights is zero has some important properties relative to the closest data-points on each side ( x + and x – ) 8 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 7 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 7 8 2

  3. <latexit sha1_base64="JvrWTP+V4MhX4CkPh1bvWi8oeuk=">ACNnicZVC7TsMwFHV4Q3kEGFksHhISqErCAsSgoURJApIbRU5jtNaOHbk3LSgKL8F7OzsDLABKwvMuAkMhStZOvf4nvs4QSJ4Co7zZI2Mjo1PTE5Nz9Rm5+YX7MWl81RlmrIGVULpy4CkTHDJGsBsMtEMxIHgl0EV0eD/4se0ylX8gxuEtaOSUfyiFMChvLt01ZMoBtEeb/ALRoqwL/EdeFzvI/7vouv/ZxvY7fAWyb1qtQbZKUiLWlZ0bLw7TWn7pSB/wP3B6wdrH/ePvRqXye+/dgKFc1iJoEKkqZN10mgnRMNnApWzLSylCWEXpEOy8uDC7xhqBHSpsnAZfsUJ1UB4pG5mEO21cy6TDJikVZsoExgUHniDQ64ZBXFjAKGam/mYdokmFIyDQ510Jli4jXsD20Ozq+goU9+NPbOvMcD9e+5/cO7V3Z26d2qcOERVTKEVtIo2kYt20QE6RieogSi6Q8/oDb1b9aL9Wq9V6Uj1o9mGQ2F9fENL4OuPg=</latexit> <latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit> <latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit> <latexit sha1_base64="eCN1fAxnt98afSPVlPhRGpC+A=">ACRHicZVFNS8QwE39/nbVo5fgBwguS1sPehFELx5XcFfBXUqaprvRNCnpdFK/5Z/QPTo3d+gN12PYtpdwdWBkMebeZOZFz8WPAHbfrHGxicmp6ZnZufmFxaXlisrq81EpZqyBlVC6UufJExwyRrAQbDLWDMS+YJd+DcnRf6ix3TClTyHu5i1I9KRPOSUgKG8iteKCHT9MLvNPY5bNFCAf1HX+BDfehmvOnlxX1exAbsDyh1SbsGUyuQnJYcpmXuVTbtml4H/A2cINo+2+vcPvfnPuld5bgWKphGTQAVJkivHjqGdEQ2cCpbPtdKExYTekA7LyvVzvG2oAIdKmyMBl+xInVRQrjuivkohPGhnXMYpMEkHbcJUYFC4cAoHXDMK4s4AQjU372PaJZpQMH6OdNKpYEV94pPCMysoqNMfTdyzbzGAOfvuv9B0605ezX3zDhxjAYxg9bRBtpBDtpHR+gU1VEDUfSIXtEH6ltP1pv1bvUHpWPWULOGRsL6+gY3orRB</latexit> Mathematics of SVMs Mathematics of SVMs } Although complex, a constrained optimization problem like this can be } Through the magic of mathematics (Lagrangian multipliers, to algorithmically solved to get the 𝛽 i values we want: be specific), we can derive a quadratic programming problem α i − 1 X X W ( α ) = α i α j y i y j ( x i · x j ) We start with our data-set: 1. 2 i i,j A note about notation: these equations involve { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } [ ∀ i, y i ∈ { +1 , − 1 } ] ∀ i, α i ≥ 0 two different, necessary products: 1. The usual application of weights to points : We then solve a constrained optimization problem: X α i y i = 0 2. α i − 1 w · x i = w 1 x i, 1 + w 2 x i, 2 + · · · + w n x i,n i X X W ( α ) = α i α j y i y j ( x i · x j ) 2. Products of points and other points : 2 i i,j x i · x j = x i, 1 x j, 1 + x i, 2 x j, 2 + · · · + x i,n x j,n The goal : based on known values ( ) ∀ i, α i ≥ 0 x i , y i find the values we don’t know ( 𝛽 i ) that: } Once done, we can find the weight-vector and bias term if we want: X α i y i = 0 1. Will maximize value of margin W ( 𝛽 ) b = − 1 X w = α i y i x i 2( max i | y i = − 1 w · x i + j | y j =+1 w · x j ) min 2. Satisfy the two numerical constraints i i 10 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 9 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 9 10 The Dual Formulation The Dual Formulation } It turns out that we don’t need to use the weights at all X w · x i + b = α j y j ( x i · x j ) + b } Instead, we can simply use the 𝛽 i values directly : j } Now, if we had to sum over every data-point as on the X w · x i + b = α j y j ( x i · x j ) + b right-hand side of this equation, this would look very bad j for a large data-set } It turns out that these 𝛽 i values have a special property, What we usually look for in however, that makes it feasible to use them as part of our a parametric method: the What we can use instead : we compute an weights, w , and offset, b , classification function… equivalent result based upon the 𝛽 defining the classifier parameters, the outputs y , and products between data-points themselves (along with the standard offset) 12 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 11 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 11 12 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend