Kernel Methods CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation
Kernel Methods CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation
Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Not linearly separable data } Noisy data or overlapping classes 2 } (we discussed about it: soft margin) } Near linearly separable 1 }
Not linearly separable data
2
} Noisy data or overlapping classes } (we discussed about it: soft margin)
} Near linearly separable
} Non-linear decision surface
} Transform to a new feature space
๐ฆ2 ๐ฆ1 ๐ฆ2 ๐ฆ1
Nonlinear SVM
3
} Assume a transformation ๐: โ' โ โ) on the feature
space
} ๐ โ ๐ ๐
} Find a hyper-plane in the transformed feature space:
๐ฆ2 ๐ฆ1 ๐,(๐) ๐/(๐) ๐: ๐ โ ๐ ๐ ๐2๐ ๐ + ๐ฅ5 = 0 {๐,(๐),...,๐)(๐)}: set of basis functions (or features) ๐< ๐ : โ' โ โ ๐ ๐ = [๐,(๐), . . . , ๐)(๐)]
Soft-margin SVM in a transformed space: Primal problem
4
} Primal problem:
min
๐,BC
1 2 ๐ / + ๐ท E ๐G
H GI,
- s. t. ๐ง G
๐2๐(๐ G ) + ๐ฅ5 โฅ 1 โ ๐G ๐ = 1, โฆ , ๐ ๐G โฅ 0
} ๐ โ โ): the weights that must be found } If ๐ โซ ๐ (very high dimensional feature space) then there are
many more parameters to learn
Soft-margin SVM in a transformed space: Dual problem
5
} Optimization problem:
max
๐ท
E ๐ฝG
H GI,
โ 1 2 E E ๐ฝG๐ฝ)๐ง(G)๐ง())๐ ๐(G) 2๐ ๐())
H )I, H GI,
}
Subject to โ ๐ฝG๐ง(G) = 0
H GI,
}
0 โค ๐ฝG โค ๐ท ๐ = 1, โฆ , ๐
} If we have inner products ๐ ๐(<) 2๐ ๐(\) , only ๐ท =
[๐ฝ,, โฆ , ๐ฝH] needs to be learnt.
} not necessary to learn ๐ parameters as opposed to the primal problem
Classifying a new data
6
๐ง ] = ๐ก๐๐๐ ๐ฅ5 + ๐2๐(๐) where ๐ = โ ๐ฝG ๐ง(G)๐(๐(G))
- bcd5
and ๐ฅ5 = ๐ง(e) โ ๐2๐(๐(e))
Kernel SVM
7
} Learns linear decision boundary in a high dimension space
without explicitly working on the mapped data
} Let ๐ ๐ 2๐ ๐f = ๐ฟ(๐, ๐f) (kernel)
} Example: ๐ = ๐ฆ,, ๐ฆ/ and second-order ๐:
๐ ๐ = 1, ๐ฆ,, ๐ฆ/, ๐ฆ,
/, ๐ฆ/ /, ๐ฆ,๐ฆ/
๐ฟ ๐, ๐f = 1 + ๐ฆ,๐ฆ,
f + ๐ฆ/๐ฆ/ f + ๐ฆ, /๐ฆ, f/ + ๐ฆ/ /๐ฆ/ f/ + ๐ฆ,๐ฆ, f๐ฆ/๐ฆ/ f
Kernel trick
8
} Compute ๐ฟ ๐, ๐f without transforming ๐ and ๐โฒ } Example: Consider ๐ฟ ๐, ๐f = 1 + ๐2๐f / = 1 + ๐ฆ,๐ฆ,
f + ๐ฆ/๐ฆ/ f /
= 1 + 2๐ฆ,๐ฆ,
f + 2๐ฆ/๐ฆ/ f + ๐ฆ, /๐ฆ, f/ + ๐ฆ/ /๐ฆ/ f/ + 2๐ฆ,๐ฆ, f๐ฆ/๐ฆ/ f
This is an inner product in:
๐ ๐ = 1, 2
- ๐ฆ,,
2
- ๐ฆ/, ๐ฆ,
/, ๐ฆ/ /,
2
- ๐ฆ,๐ฆ/
๐ ๐โฒ = 1, 2
- ๐ฆ,
f,
2
- ๐ฆ/
f, ๐ฆโฒ, /, ๐ฆโฒ/ /,
2
- ๐ฆ,
f๐ฆ/ f
Polynomial kernel: Degree two
9
} We instead use ๐ฟ(๐, ๐โฒ) = ๐2๐โฒ + 1 / that corresponds to:
๐ ๐ = 1, 2
- ๐ฆ,, โฆ ,
2
- ๐ฆ', ๐ฆ,
/, . . , ๐ฆ' /,
2
- ๐ฆ,๐ฆ/, โฆ ,
2
- ๐ฆ,๐ฆ',
2
- ๐ฆ/๐ฆi, โฆ ,
2
- ๐ฆ'j,๐ฆ'
2
๐-dimensional feature space ๐ = ๐ฆ,, โฆ ,๐ฆ' 2
Polynomial kernel
10
} This can similarly be generalized to d-dimensioan ๐ and ๐s are
polynomials of order ๐: ๐ฟ ๐, ๐f = 1 + ๐2๐f l
= 1 + ๐ฆ,๐ฆ,
f + ๐ฆ/๐ฆ/ f + โฏ + ๐ฆ'๐ฆ' f l
} Example: SVM boundary for a polynomial kernel
} ๐ฅ5 + ๐2๐ ๐ = 0 } โ ๐ฅ5 + โ
๐ฝ<๐ง(<)๐ ๐ <
2๐ ๐ = 0
- bod5
} โ ๐ฅ5 + โ
๐ฝ<๐ง(<)๐(๐ < , ๐) = 0
- bod5
} โ ๐ฅ5 + โ
๐ฝ<๐ง(<) 1 + ๐(<)q๐
l
= 0
- bod5
Boundary is a polynomial of order ๐
Why kernel?
11
} kernel functions ๐ฟ can indeed be efficiently computed, with a
cost proportional to ๐ (the dimensionality of the input) instead of ๐.
} Example: consider the second-order polynomial transform:
๐ ๐ = 1, ๐ฆ,, โฆ , ๐ฆ', ๐ฆ,
/, ๐ฆ,๐ฆ/, โฆ , ๐ฆ'๐ฆ' 2
๐ ๐ 2๐ ๐โฒ = 1 + E ๐ฆ<๐ฆ<
f ' <I,
+ E E ๐ฆ<๐ฆ\๐ฆ<
f๐ฆ\ f ' \I, ' <I,
๐ ๐ 2๐ ๐โฒ = 1 + ๐ฆ2๐ฆf + ๐ฆ2๐ฆf /
E ๐ฆ<๐ฆ<
f ' <I,
ร E ๐ฆ\๐ฆ\
f ' \I,
๐ = 1 + ๐ + ๐/ ๐(๐) ๐(๐)
Gaussian or RBF kernel
12
} If ๐ฟ ๐, ๐f is an inner product in some transformed space of x,
it is good
} ๐ฟ ๐, ๐f = exp
(โ
๐j๐w x y
)
} Take one dimensional case with ๐ฟ = 1:
๐ฟ ๐ฆ, ๐ฆf = exp โ ๐ฆ โ ๐ฆf / = exp โ๐ฆ/ exp โ๐ฆโฒ/ exp 2๐ฆ๐ฆโฒ = exp โ๐ฆ/ exp โ๐ฆโฒ/ E 2{๐ฆ{๐ฆโฒ{ ๐!
} {I5
Some common kernel functions
13
} Linear: ๐(๐, ๐f) = ๐2๐โฒ } Polynomial: ๐ ๐, ๐f = (๐2๐f + 1)l } Gaussian: ๐ ๐, ๐f = exp
(โ
๐j๐w x y
)
} Sigmoid: ๐ ๐, ๐f = tanh
(๐๐2๐f + ๐)
Kernel formulation of SVM
14
} Optimization problem:
max
๐ท
E ๐ฝG
H GI,
โ 1 2 E E ๐ฝG๐ฝ)๐ง(G)๐ง())๐ ๐(G) 2๐ ๐())
H )I, H GI,
}
Subject to โ ๐ฝG๐ง(G) = 0
H GI,
}
0 โค ๐ฝG โค ๐ท ๐ = 1, โฆ , ๐
๐(๐ G , ๐()))
๐น = ๐ง , ๐ง , ๐ฟ ๐ , , ๐ , โฏ ๐ง , ๐ง H ๐ฟ ๐ H , ๐ , โฎ โฑ โฎ ๐ง H ๐ง , ๐ฟ ๐ H , ๐ , โฏ ๐ง H ๐ง H ๐ฟ ๐ H , ๐ H
Classifying a new data
15
๐ง ] = ๐ก๐๐๐ ๐ฅ5 + ๐2๐(๐) where ๐ = โ ๐ฝG ๐ง(G)๐(๐(G))
- bcd5
and ๐ฅ5 = ๐ง(e) โ ๐2๐(๐(e)) ๐ง ] = ๐ก๐๐๐ ๐ฅ5 + E ๐ฝG ๐ง G ๐ ๐ G
2
- bcd5
๐(๐) ๐ฅ5 = ๐ง(e) โ E ๐ฝG ๐ง G ๐ ๐ G
2
- bcd5
๐ ๐ e
๐(๐ G , ๐) ๐(๐ G , ๐(e))
Gaussian kernel
16
} Example: SVM boundary for a Gaussian kernel
} Considers a Gaussian function around each data point.
} ๐ฅ5 + โ
๐ฝ<๐ง(<)exp (โ
๐j๐(o) x โ
) = 0
- bod5
} SVM + Gaussian Kernel can classify any arbitrary training set
} Training error is zero when ๐ โ 0
ยจ All samples become support vectors (likely overfiting)
Hard margin Example
17
} For narrow Gaussian (large ๐), even the protection of a large
margin cannot suppress overfitting.
- Y. Abu-Mostafa et. Al, 2012
SVM Gaussian kernel: Example
18
This example has been adopted from Zissermanโs slides ๐ ๐ = ๐ฅ5 + E ๐ฝ<๐ง(<)exp (โ ๐ โ ๐(<) / 2๐/ )
- bod5
SVM Gaussian kernel: Example
19
This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example
20
This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example
21
This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example
22
This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example
23
This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example
24
This example has been adopted from Zissermanโs slides
Kernel trick: Idea
25
} Kernel trick โ Extension of many well-known algorithms to
kernel-based ones
} By substituting the dot product with the kernel function
} ๐ ๐, ๐f = ๐ ๐ 2๐(๐โฒ) } ๐ ๐, ๐f shows the dot product of ๐ and ๐f in the transformed space.
} Idea: when the input vectors appears only in the form of dot
products, we can use kernel trick
} Solving the problem without explicitly mapping the data
} Explicit mapping is expensive if ๐ ๐ is very high dimensional
Kernel trick: Idea (Contโd)
26
} Instead of using a mapping ๐: ๐ด โ โฑ to represent ๐ โ ๐ด by
๐(๐) โ โฑ, a kernel function ๐: ๐ดร๐ด โ โ is used.
} We specify only an inner product function between points in the
transformed space (not their coordinates)
} In many cases, the inner product in the embedding space can be
computed efficiently.
Constructing kernels
27
} Construct kernel functions directly
} Ensure that it is a valid kernel
} Corresponds to an inner product in some feature space.
} Example: ๐(๐, ๐f) = ๐2๐f /
} Corresponding mapping: ๐ ๐ = ๐ฆ,
/,
2
- ๐ฆ,๐ฆ/, ๐ฆ/
/ 2 for ๐ =
๐ฆ,, ๐ฆ/ 2
} We need a way to test whether a kernel is valid without
having to construct ๐ ๐
Construct Valid Kernels
28
- ๐ > 0 , ๐1: valid kernel
- ๐(. ): any function
- ๐(. ): a polynomial with coefficientsโฅ 0
- ๐1, ๐2: valid kernels
- ๐(๐): a function from ๐ to โl
๐3(. , . ): a valid kernel in โl
- ๐ฉ : a
symmetric positive semi-definite matrix
- ๐ลฝ and ๐โข are variables (not necessarily
disjoint) with ๐ = (๐ลฝ, ๐โข), and ๐ลฝ and ๐โข are valid kernel functions
- ver
their respective spaces. [Bishop]
Valid kernel: Necessary & sufficient conditions
29
} Gram matrix ๐ณHรH: ๐ฟ<\ = ๐(๐(<), ๐(\))
} Restricting the kernel function to a set of points {๐ , , ๐ / , โฆ , ๐(H)}
๐ฟ = ๐(๐(,), ๐(,)) โฏ ๐(๐(,), ๐(H)) โฎ โฑ โฎ ๐(๐(H), ๐(,)) โฏ ๐(๐(H), ๐(H))
} Mercer Theorem: The kernel matrix is Symmetric Positive
Semi-Definite (for any choice of data points)
} Any symmetric positive definite matrix can be regarded as a kernel
matrix, that is as an inner product matrix in some space
[Shawe-Taylor & Cristianini 2004]
Extending linear methods to kernelized ones
30
} Kernelized version of linear methods
} Linear methods are famous
}
Unique optimal solutions, faster learning algorithms, and better analysis
} However, we often require nonlinear methods in real-world problems
and so we can use kernel-based version of these linear algorithms
} Replacing inner products with kernels in linear algorithms โ
very flexible methods
} We can operate in the mapped space without ever computing the
coordinates of the data in that space
Example: kernelized minimum distance classifier
31
} If
๐ โ ๐, < ๐ โ ๐/ then assign ๐ to ๐,
๐ โ ๐, 2 ๐ โ ๐, < ๐ โ ๐/ 2 ๐ โ ๐/ โ2๐2๐, + ๐,
2๐, < โ2๐2๐/ + ๐/ 2๐/
โ2 โ ๐2๐ G
- โ c I,
๐, + โ โ ๐ G 2๐ )
- โ โข I,
- โ c I,
๐,ร๐, < โ2 โ ๐2๐ G
- โ c I/
๐/ + โ โ ๐ G 2๐ )
- โ โข I/
- โ c I/
๐/ร๐/
โ2 โ ๐ฟ ๐, ๐ G
- โ c I,
๐
,
+ โ โ ๐ฟ ๐ G , ๐ )
- โ โข I,
- โ c I,
๐
,ร๐ ,
< โ2 โ ๐ฟ ๐, ๐ G
- โ c I/
๐/ + โ โ ๐ฟ ๐ G , ๐ )
- โ โข I/
- โ c I/
๐/ร๐/
Which information can be obtained from kernel?
32
} Example: we know all pairwise distances
} ๐ ๐ ๐ , ๐ ๐
/ =
๐ ๐ โ ๐ ๐
/ = ๐ ๐, ๐ + ๐ ๐, ๐ โ 2๐(๐, ๐)
}
Therefore, we also know distance of points from center of mass of a set
} Many dimensionality reduction, clustering, and classification
methods can be described according to pairwise distances.
} This allow us to introduce kernelized versions of them
Example: Kernel ridge regression
33
min
๐ E ๐2๐ G โ ๐ง G / H GI,
+ ๐๐2๐ E 2๐ G ๐2๐ G โ ๐ง G
H GI,
+ 2๐๐ = ๐ โ ๐ = E ๐ฝG๐ G
H GI,
๐ฝG = โ 1 ๐ ๐2๐ G โ ๐ง G
Example: Kernel ridge regression (Contโd)
34
min
๐ E ๐2๐ ๐ G
โ ๐ง G
/ H GI,
+ ๐๐2๐
} Dual representation:
๐พ ๐ท = ๐ท2๐พ๐พ2๐พ๐พ2๐ท โ 2๐ท2๐พ๐พ2๐ + ๐2๐ + ๐๐ท2๐พ๐พ2๐ท ๐พ ๐ท = ๐ท2๐ณ๐ณ๐ท โ 2๐ท2๐ณ๐ + ๐2๐ + ๐๐ท2๐ณ๐ท ๐ผ
๐ท๐พ ๐ท = ๐ โ ๐ท = ๐ณ + ๐๐ฑH j,๐
๐ = E ๐ฝG๐ ๐ G
H GI,
Example: Kernel ridge regression (Contโd)
35
} Prediction for new ๐:
๐ ๐ = ๐2๐ ๐ = ๐ท2๐พ๐ ๐ = ๐ฟ(๐(,), ๐) โฎ ๐ฟ(๐(H), ๐)
2
๐ณ + ๐๐ฑH j,๐
๐ = ๐พ2๐ท
Kernels for structured data
36
} Kernels also can be defined on general types of data
} Kernel functions do not need to be defined over vectors
} just we need a symmetric positive definite matrix
} Thus, many algorithms can work with general (non-vectorial)
data
} Kernels exist to embed strings, trees, graphs, โฆ
} This may be more important than nonlinearity
} kernel-based version of classical learning algorithms for recognition
- f structured data
Kernel function for objects
37
} Sets: Example of kernel function for sets:
๐ ๐ต, ๐ถ = 2 โฉยข
} Strings: The inner product of the feature vectors for two
strings can be defined as
} e.g., sum over all common subsequences weighted according to
their frequency of occurrence and lengths
A E G A T E A G G E G T E A G A E G A T G
Kernel trick advantages: summary
38
} Operating in the mapped space without ever computing the
coordinates of the data in that space
} Besides vectors, we can introduce kernel functions for
structured data (graphs, strings, etc.)
} Much of the geometry of the data in the embedding space is
contained in all pairwise dot products
} In many cases, inner product in the embedding space can be
computed efficiently.
Resources
39