support vector manifold learning for solving regression
play

Support Vector Manifold Learning for Solving Regression Problems via - PowerPoint PPT Presentation

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel AGH University of Science and Technology in Poland 1 / 29 2 / 29 1.0 y -1.0 1.0 x (a) 3 / 29 new kernel c )) T ( ( ( ( x i ) +


  1. Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel AGH University of Science and Technology in Poland 1 / 29

  2. 2 / 29

  3. 1.0 y -1.0 1.0 x (a) 3 / 29

  4. new kernel c )) T ( ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x j ) + y j t ϕ ( � c )) = (1) x i ) T ϕ ( � x i ) T ϕ ( � ϕ ( � x j ) + y j t ϕ ( � c ) (2) c ) T φ ( � c ) T ϕ ( � x j ) + y j y i t 2 ϕ ( � + y i t ϕ ( � c ) . (3) K ( � x i , � x j ) + y j tK ( � x i ,� c ) (4) x j ) + y j y i t 2 K ( � + y i tK ( � c , � c ,� c ) . (5) cross kernel c )) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x ) = (6) x i ) T ϕ ( � c ) T ϕ ( � ϕ ( � x ) + y i t ϕ ( � x ) . (7) So K ( � x i ,� x ) + y i tK ( � c ,� x ) . (8) the number of support vectors is maximally equal to n + 1 4 / 29

  5. Proposition 1 Shifting a hyperplane with any vector � c gives a new hyperplane which differs from the original by a free term b. Lemma 1 After duplicating and shifting a n − 1 dimensional hyperplane constrained by n-dimensional hypersphere, the maximal distance from an original center of a hypersphere to any point belonging to the shifted hyperplane will be for a point such as after projecting this point to the n − 1 dimensional hyperplane (before shift), a vector from � 0 to this point will be parallel to a vector from � 0 to a projected center of one of new hyperspheres (a shifted hyperplane). 5 / 29

  6. Proposition 2 The radius R n of a minimal hypersphere containing all points after shifting is equal to R n = � � c + R � c m / � � c m �� (9) where c m is defined as c − b + � w · � c c m = � � w . � (10) � w � 2 and � c m � � = 0 . For � c m � = 0 , we get R n = � � c � = � � c p � . 6 / 29

  7. Proposition 3 Consider hyperplanes � w c · � x = 0 , where � w c is normalized such that they are in a canonical form, that is for a set of points A = { � x 1 , . . . , � x n } min | � w c · � x i | = 1 . (11) i The set of decision functions f w ( � x ) = sgn � x · � w c defined on A, satisfying the constraint � � w c � ≤ D has a Vapnik-Chervonenkis (VC) dimension satisfying � � R 2 D 2 , m + 1 h ≤ min , (12) where R is the radius of the smallest sphere centered at the origin and containing A. 7 / 29

  8. We can improve generalization bounds when D 2 c m �� 2 c p � ) 2 ≤ R 2 D 2 � � c + R � c m / � � (13) (1 + D � � c m �� 2 � � c + R � c m / � � ≤ R 2 (14) c p � ) 2 (1 + D � � For a special case, when � c m � = 0, we get � � c p � c p � ) 2 ≤ R 2 . (15) (1 + D � � 8 / 29

  9. Performance measure For OCSVM the distance between a point � r and the minimal hypersphere in a kernel-induced feature space can be computed as  n n � � α i α j K ( � x i , � R − x j ) (16)  i =1 j =1 1 / 2  n � − 2 α j K ( � x j ,� r ) + K ( � r ,� r ) . (17)  j =1 For kernels for which K ( � x ,� x ) is constant, such as the radial basis function (RBF) kernel, the radius R can be computed as follows � n n � x j ) + 2 b ∗ . � � � R = � K ( � x ,� x ) + α i α j K ( � x i , � (18) i =1 j =1 9 / 29

  10. Performance measure For SVML, the distance between a point � r and the hyperplane in a kernel-induced feature space can be computed as | � w c · � r + b c | = (19) � w c � 2 � � � + b c ��� n i =1 y i c α ∗ � � i K ( � x i ,� r ) � . (20) �� n c y j � n c α ∗ i α ∗ j =1 y i j K ( � x i , � x j ) i =1 10 / 29

  11. 1.0 1.0 y y 0.0 0.0 0.0 0.0 x (a) (b) 11 / 29

  12. 1.0 y 0.0 0.0 x Fig. 3: Clustering based on curve learning. Points—examples. (a) Solid line—solution of SVCL. (b) Solid line—solution of SVMLC. (c) Solid line—solution of KPCA 12 / 29

  13. 1.0 1.0 y y 0.0 0.0 0.0 0.9 0.0 0.9 x x (a) (b) Fig. 4: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC. (c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b). 13 / 29

  14. 1.0 1.0 y y 0.0 0.0 0.0 0.9 0.0 0.9 x x () () Fig. 5: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC. (c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b). 14 / 29

  15. Goal develop a method for dimensionality reduction based on support vector machines (SVM) reduce dimensionality by fitting a curve to data in the form of vectors (not for classification and not for regression data) it might be seen as a generalization of regression: regression fits a function to data, curve fitting fits a curve to data idea: duplicate points, shift them in a kernel space and use support vector classification (SVC) use recursive dimensionality reduction for linear decision boundary in kernel space: project points to the solution curve, repeat all steps we could also use it for clustering, similar as in self organizing maps we could use it for visualization 15 / 29

  16. Shifting in kernel space shifting points in kernel space: c )) T ( ϕ ( � x i ) T ϕ ( � x i ) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x j ) + y j t ϕ ( � c )) = ϕ ( � x j ) + y j t ϕ ( � c ) (21) c ) T φ ( � x j ) + y j y i t 2 ϕ ( � c ) T ϕ ( � + y i t ϕ ( � c ) (22) where t is a translation parameter, � c is a shifting point, ϕ ( · ) is some symmetrical kernel. cross kernel: c )) T ϕ ( � x i ) T ϕ ( � c ) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x ) = ϕ ( � x ) + y i t ϕ ( � x ) (23) we preserve sparsity, for two duplicated points, where y i = 1, y i + size = − 1 � � x i ) T ϕ ( � c ) T ϕ ( � y i α i ϕ ( � x ) + t ϕ ( � x ) (24) � � x i ) T ϕ ( � c ) T ϕ ( � ϕ ( � x ) + y i + size t ϕ ( � + y i + size α i + size x ) = (25) � � x i ) T ϕ ( � c ) T ϕ ( � ( y i α i + y i + size α i + size ) ϕ ( � x ) + ( y i α i + α i + size ) t ϕ ( � x ) (26) 16 / 29

  17. Shifting in a kernel space, cont. The second term can be summed up for all i . � c ) T ϕ ( � ( y i α i + α i + size ) t ϕ ( � x ) (27) i when α i = α i + size = C , c ) T ϕ ( � 2 Ct ϕ ( � x ) (28) so it is like adding artificial point � c to the solution curve with the parameter 2 Ct , we can sum them for multiple points 17 / 29

  18. Shifting in kernel space when ϕ is a linear kernel, we get δ support vector regression ( δ -SVR) hypothesis: it does not matter, how we choose a shifting point due to linear decision boundary in kernel space, for example we can shift only in one direction for three dimensions: � c = (0 . 0 , 0 . 0 , 1 . 0). the shifting strategy has already been tested for an input space for regression in δ -SVR and works fine 18 / 29

  19. Dimensionality reduction parametric form of a straight line through the point ϕ ( � x 1 ) in the w is � direction of � l = ϕ ( � x 1 ) + t � w � l point must belong to the hyperplane, so after substituting w T � � l + b = 0 (29) w T ( ϕ ( � � x 1 ) + t � w ) + b = 0 (30) we need to compute t , so w T ϕ ( � t = − b − � x 1 ) (31) w � 2 � � after substituting t we get the projected point w T ϕ ( � x 1 ) − b + � x 1 ) � z = ϕ ( � w � (32) w � 2 � � 19 / 29

  20. Dimensionality reduction z 1 and z 2 are new points in a kernel space, so in order to compute a kernel we just compute a dot product: � T � � � w T ϕ ( � w T ϕ ( � x 1 ) − b + � x 1 ) x 2 ) − b + � x 2 ) T � z 1 � z 2 = ϕ ( � w � ϕ ( � w � w � 2 w � 2 � � � � (33) w T ϕ ( � x 2 ) − b + � x 1 ) T � x 1 ) T ϕ ( � w T ϕ ( � z 1 � z 2 = ϕ ( � � x 2 ) (34) w � 2 � � x 1 ) T b + � w ϕ ( � w + b + � w ϕ ( � w T b + � w ϕ ( � x 2 ) x 1 ) x 2 ) − ϕ ( � � � � w (35) w � 2 w � 2 w � 2 � � � � � � 20 / 29

  21. Dimensionality reduction w T ϕ ( � x 2 ) − b + � x 1 ) x 1 ) T ϕ ( � T � w T ϕ ( � z 1 � z 2 = ϕ ( � � x 2 ) (36) w � 2 � � w T ϕ ( � x 1 ) T b + � x 2 ) � � � � w T ϕ ( � w T ϕ ( � − ϕ ( � w + � b + � x 1 ) b + � x 2 ) (37) w � 2 � � w T ϕ ( � w T ϕ ( � w T ϕ ( � x 2 ) − b � x 2 ) − 2 � x 1 ) � x 2 ) T � z 1 � z 2 = ϕ ( � x 1 ) ϕ ( � (38) w � 2 w � 2 � � � � x 1 ) T � − b ϕ ( � w + b 2 + b � w T ϕ ( � w T ϕ ( � w T ϕ ( � w T ϕ ( � x 2 ) + b � x 1 ) + � x 1 ) � x 2 ) � � w � 2 (39) we use this iteratively, in the next reduction, we will use kernel values from the previous reduction, in the first iteration we use the shift kernel, � w will be perpendicular to the previously computed � w 21 / 29

  22. Example of proposed curve fitting for folium of Descartes 2.0 y -2.0 -2.0 1.7 x Fig. 6: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1 . 5, C = 100 . 0, t = 0 . 005, � c = (0 . 1 , 0 . 0) 22 / 29

  23. Example of proposed curve fitting for folium of Descartes 2.0 y -2.0 -2.0 1.7 x Fig. 7: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1 . 5, C = 10000 . 0, t = 0 . 005, � c = (0 . 1 , 0 . 0) 23 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend