[PPT] - CptS 570 Machine Learning School of EECS Washington State PowerPoint Presentation

SLIDE 1

CptS 570 – Machine Learning School of EECS Washington State University

CptS 570 - Machine Learning 1

SLIDE 2

 Or, support vector machine (SVM)  Discriminant-based method

Learn class boundaries

 Support vector consists of examples closest

to boundary

 Kernel computes similarity between examples

Maps instance space to a higher-dimensional space

where (hopefully) linear models suffice

 Choosing the right kernel is crucial  Kernel machines among best-performing

learners

CptS 570 - Machine Learning 2

SLIDE 3

 Likely to underfit using only hyperplanes  But we can map the data to a nonlinear

space and use hyperplanes there

Φ: Rd  F
x

x  Φ(x)

CptS 570 - Machine Learning 3

Φ

SLIDE 4

 Note we want ≥+1, not ≥0  Want instances some distance from hyperplane

CptS 570 - Machine Learning 4

{ } ( )

1 as rewritten be can which 1 for 1 1 for 1 such that and find if 1 if 1 where ,

2 1

+ ≥ + − = − ≤ + + = + ≥ +    ∈ − ∈ + = = w r r w r w w C C r r

t T t t t T t t T t t t t t t

x w x w x w w x x x X

SLIDE 5

 Distance from instance xt to hyperplane

wTxt+w0

 Distance from hyperplane to closest instances

is the margin

CptS 570 - Machine Learning 5

w x w w x w ) ( w r

r

w

t T t t T

+ +

w margin

SLIDE 6

 Optimal separating hyperplane is the one

maximizing the margin

 We want to choose w maximizing ρ such that  Infinite number of solutions by scaling w  Thus, we choose solution minimizing ‖w‖

CptS 570 - Machine Learning 6

( )

t w r

t T t

∀ ≥ + , ρ w x w

( )

t w r

t T t

∀ + ≥ + , 1 2 1

2

x w w to subject min

SLIDE 7

 Quadratic optimization problem with

complexity polynomial in d (#features)

 Kernel will eventually map d-dimensional

space to higher-dimensional space

 Prefer complexity not based on #dimensions

CptS 570 - Machine Learning 7

( )

t w r

t T t

∀ + ≥ + , 1 2 1

2

x w w to subject min

SLIDE 8

 Convert optimization problem to depend on

number of training examples N (not d)

Still polynomial in N

 But optimization will depend only on closest

examples (support vector)

Typically ≪N

CptS 570 - Machine Learning 8

SLIDE 9

 Rewrite quadratic optimization problem using

Lagrange multipliers αt, 1 ≤ t ≤ N

 Minimize Lp

CptS 570 - Machine Learning 9

( ) ( )

[ ]

( ) ∑

∑ ∑

= = =

+ + − = − + − = ∀ + ≥ +

N t t N t t T t t N t t T t t p t T t

w r w r L t w r

1 1 2 1 2 2

2 1 1 2 1 , 1 subject to 2 1 min α α α x w w x w w x w w

SLIDE 10

 Equivalently, we can maximize Lp subject to

the constraints:

 Plugging these into Lp …

CptS 570 - Machine Learning 10

1 1

= ⇒ = ∂ ∂ = ⇒ = ∂ ∂

∑ ∑

= = N t t t p N t t t t p

r w L r L α α x w w

SLIDE 11

 Maximize Ld with respect to αt only  Complexity O(N3)

CptS 570 - Machine Learning 11

( ) ( ) ( )

∑ ∑ ∑∑ ∑ ∑ ∑ ∑

∀ ≥ = + − = + − = + − − =

t

and to subject t r r r r w r L

t t t t t s T t s t t s s t t t T t t t t t t t t t T T d

, 2 1 2 1 2 1 α α α α α α α α α x x w w x w w w

SLIDE 12

 Most αt = 0

I.e., rt(wTxt+w0) > 1 (xt lie outside margin)

 Support vectors: xt such that αt > 0

I.e., rt(wTxt+w0) = 1 (xt lie on margin)

 w = Σt αt rtxt  w0 = rt – wTxt for any support vector xt

Typically average over all support vectors

 Resulting discriminant is called the support

vector machine (SVM)

CptS 570 - Machine Learning 12

SLIDE 13

CptS 570 - Machine Learning 13

O = support vectors margin

SLIDE 14

 Data not linearly separable  Find hyperplane with least error  Define slack variables ξt ≥ 0 storing deviation

from the margin

CptS 570 - Machine Learning 14

( )

t t T t

w x r ξ − ≥ + 1 w

SLIDE 15

 (a) Correctly classified example far from

margin (ξt = 0)

 (b) Correctly classified example on the margin

(ξt = 0)

 (c) Correctly classified example, but inside

the margin (0 < ξt < 1)

 (d) Incorrectly classified example (ξt ≥ 1)  Soft error =

CptS 570 - Machine Learning 15

∑

t t

ξ

SLIDE 16

CptS 570 - Machine Learning 16

O = support vectors margin

SLIDE 17

 Lagrangian equation with slack variables  C is penalty factor  μt ≥ 0, new set of Lagrange multipliers  Want to minimize Lp

CptS 570 - Machine Learning 17

( ) [ ] ∑

∑ ∑

− + − + − + =

t t t t t t T t t t t p

w x r C L ξ µ ξ α ξ 1 2 1

2

w w

SLIDE 18

 Minimize Lp by setting derivatives to zero  Plugging these into Lp yields dual Ld  Maximize Ld with respect to αt

CptS 570 - Machine Learning 18

1 1

= − − ⇒ = ∂ ∂ = ⇒ = ∂ ∂ = ⇒ = ∂ ∂

∑ ∑

= = t t t p N t t t p N t t t t p

C L r w L r L µ α ξ α α x w w

SLIDE 19

 Quadratic optimization problem  Support vectors have αt >0

Examples on margin: αt < C
Examples inside margin or misclassified: αt = C

CptS 570 - Machine Learning 19

( )

∑ ∑ ∑∑

∀ ≤ ≤ = + − =

t

, and subject to 2 1 t C r r r L

t t t t t s T t s t t s s t d

α α α α α x x

SLIDE 20

 C is a regularization parameter

High C  high penalty for non-separable examples

(overfit)

Low C  less penalty (underfit)
Determine using validation set (C=1 typical)

CptS 570 - Machine Learning 20

( )

∑ ∑ ∑∑

∀ ≤ ≤ = + − =

t

, and subject to 2 1 t C r r r L

t t t t t s T t s t t s s t d

α α α α α x x

SLIDE 21

 To use previous approaches, data must be

near linearly separable

 If not, perhaps a transformation φ(x) will

help

 φ(x) are basis functions

CptS 570 - Machine Learning 21

φ

SLIDE 22

 Transform d-dimensional x space to k-

dimensional z space using basis functions φ(x)

 z=

z=φ(x) where zj = φj(x) , j=1,…,k

 Instead of w0, assume z1 = φ1(x) ≡1

CptS 570 - Machine Learning 22

∑

=

= = =

k j j j T T

w g g

1

) ( ) ( ) ( ) ( x x φ w x z w z ϕ

SLIDE 23

 Replace inner product of basis functions

φ(xt)Tφ(xs) with kernel function K(xt,xs)

CptS 570 - Machine Learning 23

[ ] ∑

∑ ∑

− + − − + =

t t t t t t T t t t t p

r C L ξ µ ξ α ξ 1 ) ( 2 1

2

x φ w w

( )

∑ ∑ ∑∑

∀ ≤ ≤ = + − =

t

, and subject to ) ( 2 1 t C r r r L

t t t t t s T t s t t s s t d

α α α α α x x φ φ

∑ ∑∑

+ − =

t t s t s t t s s t d

K r r L α α α ) , ( 2 1 x x

SLIDE 24

 Kernel K(xt,xs) computes z-space product

φ(xt)Tφ(xs) in x-space

 Matrix of kernel values K, where Kts = K(xt,xs),

called the Gram matrix

 K should be symmetric and positive semidefinite

CptS 570 - Machine Learning 24

( )

( ) ( )

( )

( ) ( )

( )

∑ ∑ ∑ ∑

= = = = =

t t t t t T t t t T t t t t t t t t

K r g r g r r x x x x φ x φ x φ w x x φ z w , α α α α

SLIDE 25

 Polynomial kernel of degree q  If q=1, then use original features  For example, when q=2 and d=2

CptS 570 - Machine Learning 25

( ) ( )

q t T t

K 1 + = x x x x , ( ) (

)

( ) ( ) [

]

T T

x x x x x x y x y x y y x x y x y x y x y x K

2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 1 1 2 2 2 1 1 2

2 2 2 1 2 2 2 1 1 1 , , , , , , = + + + + + = + + = + = x y x y x φ

SLIDE 26

 Polynomial kernel of degree 2

CptS 570 - Machine Learning 26

O = support vectors margin

SLIDE 27

 Radial basis functions (Gaussian kernel)  xt is the center  s is the radius  Larger s implies smoother boundaries

CptS 570 - Machine Learning 27

( )

        − − =

2 2

2s K

t t

x x x x exp ,

SLIDE 28

CptS 570 - Machine Learning 28

SLIDE 29

 Sigmoidal functions

CptS 570 - Machine Learning 29

) 1 2 tanh( ) , ( + =

t T t

K x x x x

tanh

SLIDE 30

 Kernel K(x,y) increases with similarity

between x and y

 Prior knowledge can be included in the kernel

function

 E.g., training examples are documents

K(D1,D2) = # shared words

 E.g., training examples are strings (e.g., DNA)

K(S1,S2) = 1 / edit distance between S1 and S2
Edit distance is the number of insertions, deletions

and/or substitutions to transform S1 into S2

CptS 570 - Machine Learning 30

SLIDE 31

 E.g., training examples are nodes in a graph

(e.g., social network)

 K(N1,N2) = 1 / length of shortest path

connecting nodes

 K(N1,N2) = #paths connecting nodes  Diffusion kernel

CptS 570 - Machine Learning 31

SLIDE 32

 Training examples are graphs, not feature

vectors

E.g., carcinogenic vs. non-carcinogenic chemical

structures

 Compare substructures of graphs

E.g., walks, paths, cycles, trees, subgraphs

 K(G1,G2) = number of identical random walks

in both graphs

 K(G1,G2) = number of subgraphs shared by

both graphs

CptS 570 - Machine Learning 32

SLIDE 33

 Training data from multiple modalities (e.g.,

biometrics, social network, audio/visual)

 Construct new kernels by combining simpler

kernels

 If K1(x,y) and K2(x,y) are valid kernels, and c is

a constant, then

are valid kernels

CptS 570 - Machine Learning 33

( ) ( ) ( ) ( ) ( ) ( )

     + = y x y x y x y x y x y x , , , , , ,

2 1 2 1

K K K K cK K

SLIDE 34

 Adaptive kernel combination  Learn both αts and ɳis

CptS 570 - Machine Learning 34

( ) ( )

∑ ∑ ∑∑ ∑ ∑ ∑

= − = =

= i t i i t t t t s i s t i i s t s t t t d i m i i

K r g K r r L K K x x x x x y x y x , ) ( , , , η α η α α α η 2 1

1

SLIDE 35

 Learn K different kernel

machines gi(x)

Each uses one class as

positive, remaining classes as negative

Choose class i such that

i=argmaxj gj(x)

Best approach in practice

CptS 570 - Machine Learning 35

SLIDE 36

 Learn K(K-1)/2 kernel

machines

Each uses one class as

positive and another class as negative

Easier (faster) learning per

kernel machine

CptS 570 - Machine Learning 36

SLIDE 37

 Learn all margins at once

zt is the class index of xt

 K*N variables to optimize (expensive)

CptS 570 - Machine Learning 37

2 2 1

1 2

≥ ≠ ∀ − + + ≥ + + ∑∑

∑

= t i t t i i t T i z t T z i t t i K i i

z i w w C

t t

ξ ξ ξ to subject min , , x w x w w

SLIDE 38

 Normally, we would use squared error  For SVM, we use ε -sensitive loss

Tolerate errors up to ε
Errors beyond ε have only linear effect

CptS 570 - Machine Learning 38

( ) ( ) ( ) ( )

     − − < − =

therwise

if , ε ε

ε t t t t t t

f r f r f r e x x x

2

) ( )] ( [ )) ( , ( w f f r f r e

T t t t t

+ = − = x w x x x

SLIDE 39

 Use slack variables to account for deviations

beyond ε

ξt

+ for positive deviations

ξt
for negative deviations

CptS 570 - Machine Learning 39

( )

∑

− + +

+

t t t

C ξ ξ

2

2 1 w min

( ) ( )

≥ + ≤ − + + ≤ + −

− + − + t t t t T t T t

r w w r ξ ξ ξ ε ξ ε , x w x w

Subject to

SLIDE 40

 Non-support vectors (inside margin):  Support vectors

⊗ on the margin:
⊠ outside margin (outlier):

CptS 570 - Machine Learning 40

) ( , , ) ( ) ( ) )( )( ( 2 1 = − ≤ ≤ ≤ ≤ − − + − − − − =

− + − + − + − + − + − +

∑ ∑ ∑ ∑∑

t t t t t t t t t t t t s T t s s t t s t d

C C to subject r L α α α α α α α α ε α α α α x x = =

− + t t

α α

C

r

C

t t

< < < <

− +

α α C

r

C

t t

= =

− +

α α

SLIDE 41

CptS 570 - Machine Learning 41

SLIDE 42

 Fitted line  f(x) is weighted sum of support vectors  Average w0 over:

CptS 570 - Machine Learning 42

C if w r C if w r

t t T t t t T t

< < − + = < < + + =

− +

α ε α ε , , x w x w

∑ ∑

− + − +

− = + − = + =

t t t t t T t t t T

w w f x w x x x w x ) ( ) )( ( ) ( α α α α

SLIDE 43

CptS 570 - Machine Learning 43

) ( , , ) ( ) ( ) ( ) )( ( 2 1 = − ≤ ≤ ≤ ≤ − − + − − − − =

− + − + − + − + − + − +

∑ ∑ ∑ ∑∑

t t t t t t t t t t t t s t s s t t s t d

C C to subject r , K L α α α α α α α α ε α α α α x x

∑

+ − = + =

− + t t t t T

w ) , K w f ( ) ( ) ( x x x w x α α

SLIDE 44

 Polynomial

(quadratic) kernel

 Gaussian kernel

CptS 570 - Machine Learning 44

SLIDE 45

 Classification: SMO  Regression: SMOreg  Sequential Minimal Optimization (SMO)  Kernels

Polynomial
RBF
String

CptS 570 - Machine Learning 45

SLIDE 46

 Seek optimal separating hyperplane  Support vector machine (SVM) finds

hyperplane using only closest examples

 Kernel function allows SVM to operate in

higher dimensions

 Kernel regression  Choosing correct kernel is crucial  Kernel machines among best-performing

learners

CptS 570 - Machine Learning 46