CptS 570 – Machine Learning School of EECS Washington State University
CptS 570 - Machine Learning 1
CptS 570 Machine Learning School of EECS Washington State - - PowerPoint PPT Presentation
CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Or, support vector machine (SVM) Discriminant-based method Learn class boundaries Support vector consists of examples closest
CptS 570 – Machine Learning School of EECS Washington State University
CptS 570 - Machine Learning 1
Or, support vector machine (SVM) Discriminant-based method
Support vector consists of examples closest
Kernel computes similarity between examples
where (hopefully) linear models suffice
Choosing the right kernel is crucial Kernel machines among best-performing
CptS 570 - Machine Learning 2
Likely to underfit using only hyperplanes But we can map the data to a nonlinear
x Φ(x)
CptS 570 - Machine Learning 3
Note we want ≥+1, not ≥0 Want instances some distance from hyperplane
CptS 570 - Machine Learning 4
1 as rewritten be can which 1 for 1 1 for 1 such that and find if 1 if 1 where ,
2 1
+ ≥ + − = − ≤ + + = + ≥ + ∈ − ∈ + = = w r r w r w w C C r r
t T t t t T t t T t t t t t t
x w x w x w w x x x X
Distance from instance xt to hyperplane
Distance from hyperplane to closest instances
CptS 570 - Machine Learning 5
w x w w x w ) ( w r
w
t T t t T
+ +
w margin
Optimal separating hyperplane is the one
We want to choose w maximizing ρ such that Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖
CptS 570 - Machine Learning 6
t w r
t T t
∀ ≥ + , ρ w x w
t w r
t T t
∀ + ≥ + , 1 2 1
2
x w w to subject min
Quadratic optimization problem with
Kernel will eventually map d-dimensional
Prefer complexity not based on #dimensions
CptS 570 - Machine Learning 7
t w r
t T t
∀ + ≥ + , 1 2 1
2
x w w to subject min
Convert optimization problem to depend on
But optimization will depend only on closest
CptS 570 - Machine Learning 8
Rewrite quadratic optimization problem using
Minimize Lp
CptS 570 - Machine Learning 9
= = =
+ + − = − + − = ∀ + ≥ +
N t t N t t T t t N t t T t t p t T t
w r w r L t w r
1 1 2 1 2 2
2 1 1 2 1 , 1 subject to 2 1 min α α α x w w x w w x w w
Equivalently, we can maximize Lp subject to
Plugging these into Lp …
CptS 570 - Machine Learning 10
1 1
= ⇒ = ∂ ∂ = ⇒ = ∂ ∂
= = N t t t p N t t t t p
r w L r L α α x w w
Maximize Ld with respect to αt only Complexity O(N3)
CptS 570 - Machine Learning 11
∀ ≥ = + − = + − = + − − =
t
and to subject t r r r r w r L
t t t t t s T t s t t s s t t t T t t t t t t t t t T T d
, 2 1 2 1 2 1 α α α α α α α α α x x w w x w w w
Most αt = 0
Support vectors: xt such that αt > 0
w = Σt αt rtxt w0 = rt – wTxt for any support vector xt
Resulting discriminant is called the support
CptS 570 - Machine Learning 12
CptS 570 - Machine Learning 13
O = support vectors margin
Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation
CptS 570 - Machine Learning 14
t t T t
(a) Correctly classified example far from
(b) Correctly classified example on the margin
(c) Correctly classified example, but inside
(d) Incorrectly classified example (ξt ≥ 1) Soft error =
CptS 570 - Machine Learning 15
t t
CptS 570 - Machine Learning 16
O = support vectors margin
Lagrangian equation with slack variables C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp
CptS 570 - Machine Learning 17
− + − + − + =
t t t t t t T t t t t p
w x r C L ξ µ ξ α ξ 1 2 1
2
w w
Minimize Lp by setting derivatives to zero Plugging these into Lp yields dual Ld Maximize Ld with respect to αt
CptS 570 - Machine Learning 18
1 1
= − − ⇒ = ∂ ∂ = ⇒ = ∂ ∂ = ⇒ = ∂ ∂
= = t t t p N t t t p N t t t t p
C L r w L r L µ α ξ α α x w w
Quadratic optimization problem Support vectors have αt >0
CptS 570 - Machine Learning 19
∀ ≤ ≤ = + − =
t
, and subject to 2 1 t C r r r L
t t t t t s T t s t t s s t d
α α α α α x x
C is a regularization parameter
(overfit)
CptS 570 - Machine Learning 20
∀ ≤ ≤ = + − =
t
, and subject to 2 1 t C r r r L
t t t t t s T t s t t s s t d
α α α α α x x
To use previous approaches, data must be
If not, perhaps a transformation φ(x) will
φ(x) are basis functions
CptS 570 - Machine Learning 21
Transform d-dimensional x space to k-
z=
Instead of w0, assume z1 = φ1(x) ≡1
CptS 570 - Machine Learning 22
=
k j j j T T
1
Replace inner product of basis functions
CptS 570 - Machine Learning 23
− + − − + =
t t t t t t T t t t t p
r C L ξ µ ξ α ξ 1 ) ( 2 1
2
x φ w w
∀ ≤ ≤ = + − =
t
, and subject to ) ( 2 1 t C r r r L
t t t t t s T t s t t s s t d
α α α α α x x φ φ
+ − =
t t s t s t t s s t d
K r r L α α α ) , ( 2 1 x x
Kernel K(xt,xs) computes z-space product
φ(xt)Tφ(xs) in x-space
Matrix of kernel values K, where Kts = K(xt,xs),
called the Gram matrix
K should be symmetric and positive semidefinite
CptS 570 - Machine Learning 24
= = = = =
t t t t t T t t t T t t t t t t t t
K r g r g r r x x x x φ x φ x φ w x x φ z w , α α α α
Polynomial kernel of degree q If q=1, then use original features For example, when q=2 and d=2
CptS 570 - Machine Learning 25
q t T t
K 1 + = x x x x , ( ) (
( ) ( ) [
T T
x x x x x x y x y x y y x x y x y x y x y x K
2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 1 1 2 2 2 1 1 2
2 2 2 1 2 2 2 1 1 1 , , , , , , = + + + + + = + + = + = x y x y x φ
Polynomial kernel of degree 2
CptS 570 - Machine Learning 26
O = support vectors margin
Radial basis functions (Gaussian kernel) xt is the center s is the radius Larger s implies smoother boundaries
CptS 570 - Machine Learning 27
2 2
t t
CptS 570 - Machine Learning 28
Sigmoidal functions
CptS 570 - Machine Learning 29
t T t
tanh
Kernel K(x,y) increases with similarity
Prior knowledge can be included in the kernel
E.g., training examples are documents
E.g., training examples are strings (e.g., DNA)
and/or substitutions to transform S1 into S2
CptS 570 - Machine Learning 30
E.g., training examples are nodes in a graph
K(N1,N2) = 1 / length of shortest path
K(N1,N2) = #paths connecting nodes Diffusion kernel
CptS 570 - Machine Learning 31
Training examples are graphs, not feature
structures
Compare substructures of graphs
K(G1,G2) = number of identical random walks
K(G1,G2) = number of subgraphs shared by
CptS 570 - Machine Learning 32
Training data from multiple modalities (e.g.,
Construct new kernels by combining simpler
If K1(x,y) and K2(x,y) are valid kernels, and c is
CptS 570 - Machine Learning 33
+ = y x y x y x y x y x y x , , , , , ,
2 1 2 1
K K K K cK K
Adaptive kernel combination Learn both αts and ɳis
CptS 570 - Machine Learning 34
= − = =
= i t i i t t t t s i s t i i s t s t t t d i m i i
K r g K r r L K K x x x x x y x y x , ) ( , , , η α η α α α η 2 1
1
Learn K different kernel
positive, remaining classes as negative
i=argmaxj gj(x)
CptS 570 - Machine Learning 35
Learn K(K-1)/2 kernel
positive and another class as negative
kernel machine
CptS 570 - Machine Learning 36
Learn all margins at once
K*N variables to optimize (expensive)
CptS 570 - Machine Learning 37
1 2
= t i t t i i t T i z t T z i t t i K i i
t t
Normally, we would use squared error For SVM, we use ε -sensitive loss
CptS 570 - Machine Learning 38
ε t t t t t t
2
T t t t t
Use slack variables to account for deviations
+ for positive deviations
CptS 570 - Machine Learning 39
− + +
+
t t t
C ξ ξ
2
2 1 w min
≥ + ≤ − + + ≤ + −
− + − + t t t t T t T t
r w w r ξ ξ ξ ε ξ ε , x w x w
Subject to
Non-support vectors (inside margin): Support vectors
CptS 570 - Machine Learning 40
− + − + − + − + − + − +
t t t t t t t t t t t t s T t s s t t s t d
− + t t
C
C
t t
< < < <
− +
α α C
C
t t
= =
− +
α α
CptS 570 - Machine Learning 41
Fitted line f(x) is weighted sum of support vectors Average w0 over:
CptS 570 - Machine Learning 42
t t T t t t T t
− +
− + − +
t t t t t T t t t T
CptS 570 - Machine Learning 43
− + − + − + − + − + − +
t t t t t t t t t t t t s t s s t t s t d
− + t t t t T
Polynomial
Gaussian kernel
CptS 570 - Machine Learning 44
Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels
CptS 570 - Machine Learning 45
Seek optimal separating hyperplane Support vector machine (SVM) finds
Kernel function allows SVM to operate in
Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing
CptS 570 - Machine Learning 46