Support Vector Manifold Learning for Solving Regression Problems via - - PowerPoint PPT Presentation

support vector manifold learning for solving regression
SMART_READER_LITE
LIVE PREVIEW

Support Vector Manifold Learning for Solving Regression Problems via - - PowerPoint PPT Presentation

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel AGH University of Science and Technology in Poland 1 / 29 2 / 29 1.0 y -1.0 1.0 x (a) 3 / 29 new kernel c )) T ( ( ( ( x i ) +


slide-1
SLIDE 1

Support Vector Manifold Learning for Solving Regression Problems via Clustering

Marcin Orchel

AGH University of Science and Technology in Poland

1 / 29

slide-2
SLIDE 2

2 / 29

slide-3
SLIDE 3
  • 1.0

1.0 1.0 y x

(a)

3 / 29

slide-4
SLIDE 4

new kernel (ϕ ( xi) + yitϕ ( c))T (ϕ ( xj) + yjtϕ ( c)) = (1) ϕ ( xi)T ϕ ( xj) + yjtϕ ( xi)T ϕ ( c) (2) +yitϕ ( c)T φ ( xj) + yjyit2ϕ ( c)T ϕ ( c) . (3) K ( xi, xj) + yjtK ( xi, c) (4) +yitK ( c, xj) + yjyit2K ( c, c) . (5) cross kernel (ϕ ( xi) + yitϕ ( c))T ϕ ( x) = (6) ϕ ( xi)T ϕ ( x) + yitϕ ( c)T ϕ ( x) . (7) So K ( xi, x) + yitK ( c, x) . (8) the number of support vectors is maximally equal to n + 1

4 / 29

slide-5
SLIDE 5

Proposition 1

Shifting a hyperplane with any vector c gives a new hyperplane which differs from the original by a free term b.

Lemma 1

After duplicating and shifting a n − 1 dimensional hyperplane constrained by n-dimensional hypersphere, the maximal distance from an original center of a hypersphere to any point belonging to the shifted hyperplane will be for a point such as after projecting this point to the n − 1 dimensional hyperplane (before shift), a vector from 0 to this point will be parallel to a vector from 0 to a projected center of one of new hyperspheres (a shifted hyperplane).

5 / 29

slide-6
SLIDE 6

Proposition 2

The radius Rn of a minimal hypersphere containing all points after shifting is equal to Rn = c + R cm/ cm (9) where cm is defined as

  • cm =

c − b + w · c w2

  • w .

(10) and cm = 0. For cm = 0, we get Rn = c = cp.

6 / 29

slide-7
SLIDE 7

Proposition 3

Consider hyperplanes wc · x = 0, where wc is normalized such that they are in a canonical form, that is for a set of points A = { x1, . . . , xn} min

i

| wc · xi| = 1 . (11) The set of decision functions fw ( x) = sgn x · wc defined on A, satisfying the constraint wc ≤ D has a Vapnik-Chervonenkis (VC) dimension satisfying h ≤ min

  • R2D2, m + 1
  • ,

(12) where R is the radius of the smallest sphere centered at the origin and containing A.

7 / 29

slide-8
SLIDE 8

We can improve generalization bounds when

  • c + R

cm/ cm2 D2 (1 + D cp)2 ≤ R2D2 (13)

  • c + R

cm/ cm2 (1 + D cp)2 ≤ R2 (14) For a special case, when cm = 0, we get

  • cp

(1 + D cp)2 ≤ R2 . (15)

8 / 29

slide-9
SLIDE 9

Performance measure

For OCSVM the distance between a point r and the minimal hypersphere in a kernel-induced feature space can be computed as R −

 

n

  • i=1

n

  • j=1

αiαjK ( xi, xj) (16) − 2

n

  • j=1

αjK ( xj, r) + K ( r, r)

 

1/2

. (17) For kernels for which K( x, x) is constant, such as the radial basis function (RBF) kernel, the radius R can be computed as follows R =

  • K (

x, x) +

n

  • i=1

n

  • j=1

αiαjK ( xi, xj) + 2b∗ . (18)

9 / 29

slide-10
SLIDE 10

Performance measure

For SVML, the distance between a point r and the hyperplane in a kernel-induced feature space can be computed as | wc · r + bc|

  • wc2

= (19)

  • n

i=1 yi cα∗ i K (

xi, r)

+ bc

  • n

i=1

n

j=1 yi cyj cα∗ i α∗ j K (

xi, xj) . (20)

10 / 29

slide-11
SLIDE 11

0.0 1.0 0.0 y x

(a)

0.0 1.0 0.0 y

(b)

11 / 29

slide-12
SLIDE 12

0.0 1.0 0.0 y x

  • Fig. 3: Clustering based on curve learning. Points—examples. (a) Solid line—solution of
  • SVCL. (b) Solid line—solution of SVMLC. (c) Solid line—solution of KPCA

12 / 29

slide-13
SLIDE 13

0.0 1.0 0.0 0.9 y x

(a)

0.0 1.0 0.0 0.9 y x

(b)

  • Fig. 4: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC.

(c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b).

13 / 29

slide-14
SLIDE 14

0.0 1.0 0.0 0.9 y x

()

0.0 1.0 0.0 0.9 y x

()

  • Fig. 5: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC.

(c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b).

14 / 29

slide-15
SLIDE 15

Goal

develop a method for dimensionality reduction based on support vector machines (SVM) reduce dimensionality by fitting a curve to data in the form of vectors (not for classification and not for regression data) it might be seen as a generalization of regression: regression fits a function to data, curve fitting fits a curve to data idea: duplicate points, shift them in a kernel space and use support vector classification (SVC) use recursive dimensionality reduction for linear decision boundary in kernel space: project points to the solution curve, repeat all steps we could also use it for clustering, similar as in self organizing maps we could use it for visualization

15 / 29

slide-16
SLIDE 16

Shifting in kernel space

shifting points in kernel space: (ϕ( xi) + yitϕ( c))T(ϕ( xj) + yjtϕ( c)) = ϕ( xi)Tϕ( xj) + yjtϕ( xi)Tϕ( c) (21) +yitϕ( c)Tφ( xj) + yjyit2ϕ( c)Tϕ( c) (22) where t is a translation parameter, c is a shifting point, ϕ(·) is some symmetrical kernel. cross kernel: (ϕ( xi) + yitϕ( c))Tϕ( x) = ϕ( xi)Tϕ( x) + yitϕ( c)Tϕ( x) (23) we preserve sparsity, for two duplicated points, where yi = 1, yi+size = −1 yiαi

  • ϕ(

xi)Tϕ( x) + tϕ( c)Tϕ( x)

  • (24)

+yi+sizeαi+size

  • ϕ(

xi)Tϕ( x) + yi+sizetϕ( c)Tϕ( x)

  • =

(25) (yiαi + yi+sizeαi+size)

  • ϕ(

xi)Tϕ( x)

  • + (yiαi + αi+size) tϕ(

c)Tϕ( x) (26)

16 / 29

slide-17
SLIDE 17

Shifting in a kernel space, cont.

The second term can be summed up for all i.

  • i

(yiαi + αi+size) tϕ( c)Tϕ( x) (27) when αi = αi+size = C, 2Ctϕ( c)Tϕ( x) (28) so it is like adding artificial point c to the solution curve with the parameter 2Ct, we can sum them for multiple points

17 / 29

slide-18
SLIDE 18

Shifting in kernel space

when ϕ is a linear kernel, we get δ support vector regression (δ-SVR) hypothesis: it does not matter, how we choose a shifting point due to linear decision boundary in kernel space, for example we can shift only in one direction for three dimensions: c = (0.0, 0.0, 1.0). the shifting strategy has already been tested for an input space for regression in δ-SVR and works fine

18 / 29

slide-19
SLIDE 19

Dimensionality reduction

parametric form of a straight line through the point ϕ( x1) in the direction of w is l = ϕ( x1) + t w

  • l point must belong to the hyperplane, so after substituting
  • wT

l + b = 0 (29)

  • wT (ϕ (

x1) + t w) + b = 0 (30) we need to compute t, so t = −b − wTϕ ( x1)

  • w2

(31) after substituting t we get the projected point

  • z = ϕ (

x1) − b + wTϕ ( x1)

  • w2
  • w

(32)

19 / 29

slide-20
SLIDE 20

Dimensionality reduction

z1 and z2 are new points in a kernel space, so in order to compute a kernel we just compute a dot product:

  • z1

T

z2 =

  • ϕ (

x1) − b + wTϕ ( x1)

  • w2
  • w

T

ϕ ( x2) − b + wTϕ ( x2)

  • w2
  • w
  • (33)
  • z1

T

z2 = ϕ ( x1)T ϕ ( x2) − b + wTϕ ( x1)

  • w2
  • wTϕ (

x2) (34) −ϕ ( x1)T b + wϕ ( x2)

  • w2
  • w + b +

wϕ ( x1)

  • w2
  • wT b +

wϕ ( x2)

  • w2
  • w

(35)

20 / 29

slide-21
SLIDE 21

Dimensionality reduction

  • z1

T

z2 = ϕ ( x1)T ϕ ( x2) − b + wTϕ ( x1)

  • w2
  • wTϕ (

x2) (36) −ϕ ( x1)T b + wTϕ ( x2)

  • w2
  • w +
  • b +

wTϕ ( x1) b + wTϕ ( x2)

  • (37)
  • z1

T

z2 = ϕ ( x1) ϕ ( x2) − b wTϕ ( x2)

  • w2

− 2 wTϕ ( x1) wTϕ ( x2)

  • w2

(38) −bϕ ( x1)T w

  • w2

+ b2 + b wTϕ ( x2) + b wTϕ ( x1) + wTϕ ( x1) wTϕ ( x2) (39) we use this iteratively, in the next reduction, we will use kernel values from the previous reduction, in the first iteration we use the shift kernel, w will be perpendicular to the previously computed w

21 / 29

slide-22
SLIDE 22

Example of proposed curve fitting for folium of Descartes

  • 2.0

2.0

  • 2.0

1.7 y x

  • Fig. 6: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1.5,

C = 100.0, t = 0.005, c = (0.1, 0.0)

22 / 29

slide-23
SLIDE 23

Example of proposed curve fitting for folium of Descartes

  • 2.0

2.0

  • 2.0

1.7 y x

  • Fig. 7: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1.5,

C = 10000.0, t = 0.005, c = (0.1, 0.0)

23 / 29

slide-24
SLIDE 24

Example of proposed curve fitting for folium of Descartes

  • 2.0

2.0

  • 2.0

1.7 y x

  • Fig. 8: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1.5,

C = 10000.0, t = 0.005, c = (0.0, 0.1)

24 / 29

slide-25
SLIDE 25

Example of proposed curve fitting for folium of Descartes

  • 2.0

2.0

  • 2.0

1.7 y x

  • Fig. 9: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1.5,

C = 10000.0, t = 0.01, c = (0.0, 0.1)

25 / 29

slide-26
SLIDE 26

Clustering based on curve fitting

the shapes are convex in kernel space, so we can check if there is any point between any two, that has zero function value, if not they belong to separate curves. consider if kernel transformation preserves closeness of curves, is it a continue transformation?

26 / 29

slide-27
SLIDE 27

Clustering derivation

the projected vector is

  • z = ϕ (

x1) − b + wTϕ ( x1)

  • w2
  • w

(40) we can check a projected vector with any result kernel φ ( xi)T

  • ϕ (

x1) − b + wTϕ ( x1)

  • w2
  • w
  • (41)

φ ( xi)T ϕ ( x1) − φ ( xi)T w b + wTϕ ( x1)

  • w2

(42) we check value for a point between two points that is

  • b =

ϕ ( x1) − b+

wT ϕ( x1)

  • w2
  • w + µϕ (

x2) − µb+

wT ϕ( x2)

  • w2
  • w

1 + µ (43)

27 / 29

slide-28
SLIDE 28

Clustering derivation

we can check a projected vector with any cross kernel φ ( xi)T

 ϕ (

x1) − b+

wT ϕ( x1)

  • w2
  • w + µϕ (

x2) − µb+

wT ϕ( x2)

  • w2
  • w

1 + µ

 

(44)

28 / 29

slide-29
SLIDE 29

Clustering derivation

regression with shifting in kernel space φ ( xi)T

 ϕ (

x1) − b+

wT ϕ( x1)

  • w2
  • w + µϕ (

x2) − µb+

wT ϕ( x2)

  • w2
  • w

1 + µ

 

(45) value of a function in a point between in kernel space

  • i

φ ( xi)T

ϕ (

x1) + µϕ ( x2) 1 + µ

  • + b

(46)

29 / 29