Machine learning theory
Kernel methods
Hamid Beigy
Sharif university of technology
Machine learning theory Kernel methods Hamid Beigy Sharif - - PowerPoint PPT Presentation
Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020 Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24
Sharif university of technology
1/24
◮ Most of learning algorithms are linear and are not able to classify non-linearly-separable data. ◮ How do you separate these two classes? ◮ Linear separation impossible in most problems. ◮ Non-linear mapping from input space to high-dimensional feature space: φ : X → H. φ ◮ Generalization ability: independent of dim(H), depends only on ρ and m.
2/24
◮ Most datasets are not linearly separable, for example ◮ Instances that are not linearly separable in R, may be linearly separable in R2 by using mapping
◮ In this case, we have two solutions
◮ Increase dimensionality of data set by introducing mapping φ. ◮ Use a more complex model for classifier. 3/24
◮ To classify the non-linearly separable dataset, we use mapping φ. ◮ For example, let x = (x1, x2)T, z = (z1, z2.z3)T, and φ : R2 → R3. ◮ If we use mapping z = φ(x) = (x2 1,
2)T, the dataset will be linearly separable in R3. ◮ Mapping dataset to higher dimensions has two major problems.
◮ In high dimensions, there is risk of over-fitting. ◮ In high dimensions, we have more computational cost.
◮ The generalization capability in higher dimension is ensured by using large margin classifiers. ◮ The mapping is an implicit mapping not explicit.
4/24
◮ Kernel methods avoid explicitly transforming each point x in the input space into the mapped
◮ Instead, the inputs are represented via their m × m pairwise similarity values. ◮ The similarity function, called a kernel, is chosen so that it represents a dot product in some
◮ The kernel can be computed without directly constructing φ. ◮ The pairwise similarity values between points in S represented via the m × m kernel matrix,
◮ Function K(xi, xj) is called kernel function and defined as
5/24
◮ Let φ : R2 → R3 be defined as φ(x) = (x2 1, x2 2,
◮ Then φ(x), φ(z) equals to
1, x2 2,
1, z2 2,
◮ The above mapping can be described
6/24
◮ Let φ1 : R2 → R3 be defined as φ(x) = (x2 1, x2 2,
◮ Then φ1(x), φ1(z) equals to
1, x2 2,
1, z2 2,
◮ Let φ2 : R2 → R4 be defined as φ(x) = (x2 1, x2 2, x1x2, x2x1). ◮ Then φ2(x), φ2(z) equals to
1, x2 2, x1x2, x2x1), (z2 1, z2 2, z1z2, z2z1)
◮ Feature space can grow really large and really quickly. ◮ Let K be a kernel K(x, z) = (x, z)d = φ(x), φ(z) ◮ The dimension of feature space equals to
d
◮ Let n = 100, d = 6, there are1.6 billion terms.
7/24
◮ The kernel methods have the following benefits.
◮ This Theorem states that K : X × X → R is a kernel if matrix K is positive semi-definite (PSD). ◮ Suppose x, z ∈ Rn and consider the following kernel
◮ It is a valid kernel because
n
n
n
8/24
◮ Consider the polynomial kernel K(x, z) = (x, z + c)d for all x, z ∈ Rn. ◮ For n = 2 and d = 2,
1, x2 2,
1, z2 2,
◮ The left data is not linearly separable but the right one is.
9/24
◮ Some valid kernel functions
◮ Polynomial kernels consider the kernel defined by
◮ Radial basis function kernels consider the kernel defined by
◮ Sigmoid kernel consider the kernel defined by
◮ Homework:
10/24
◮ We give the crucial property of PDS kernels, which is to induce an inner product in a Hilbert
◮ This Theorem implies that PDS kernels can be used to implicitly define a feature space.
11/24
◮ For any kernel K, we can associate a normalized kernel Kn defined by
i,j=1 cicjKn(xi, xj) ≥ 0.
m
m
m
H
12/24
◮ The following theorem provides closure guarantees for all of these operations.
k=1 akxk with ak ≥ 0 for all k ∈ N.
◮ Homework:
13/24
◮ Norm of a point: we can compute the norm of a point φ(x) in feature space as
◮ Distance between Points: the distance between two points φ(xi) and φ(xj) can be computed as
◮ Mean in feature space: the mean of the points in feature space is given as
m
m
m
m
m
m
m
14/24
◮ Total variance in feature space: the squared distance of a point φ(xi) to the mean µφ in feature
m
m
m
φ = 1
m
m
m
m
m
m
m
m
m
m
m
m
m
15/24
◮ Centering in feature space:
◮ We can center each point in feature space by subtracting the mean from it
◮ We have not φ(xi) and µφ, hence, we cannot explicitly center the points. ◮ However, we can still compute the centered kernel matrix ˆ
m
m
m
m
◮ In other words, we can compute the centered kernel matrix using only the kernel function. 16/24
◮ Normalizing in feature space:
◮ A common form of normalization is to ensure that points in feature space have unit length by
◮ The dot product in feature space then corresponds to the cosine of the angle between the two
◮ If the mapped points are both centered and normalized, then a dot product corresponds to the
◮ The normalized kernel function, Kn, can be computed using only the kernel function K, as
17/24
◮ The optimization problem for SVM is defined as
◮ In order to solve this constrained optimization problem, we use the Lagrangian function
m
◮ Eliminating w and b from L(w, b, a) using these conditions then gives the dual representation of
m
m
m
◮ We need to maximize ψ(α) subject to constraints m k=1 αkyk = 0 and αk ≥ 0 ∀k. ◮ For optimal αk’s, we have αk [1 − yk (w, xk + b)] = 0. ◮ To classify a data x using the trained model, we evaluate the following function
18/24
◮ By using kernel K, the dual representation of the problem in which we maximize
m
m
m
◮ To classify a data x using the trained model, we evaluate the following function
19/24
σ
w≤Λ m
σ
w≤Λ
m
σ
σ
H
σ
σ
σ
20/24
21/24
◮ Advantages
◮ The problem doesn’t have local minima and we can found its optimal solution in polynomial time. ◮ The solution is stable, repeatable, and sparse (it only involves the support vectors). ◮ The user must select a few parameters such as the penalty term C and the kernel function and its
◮ The algorithm provides a method to control complexity independently of dimensionality. ◮ SVMs have been shown (theoretically and empirically) to have excellent generalization capabilities.
◮ Disadvantages
◮ There is no method for choosing the kernel function and its parameters. ◮ It is not a straight forward method to extend SVM to multi-class classifiers. ◮ Predictions from a SVM are not probabilistic. ◮ It has high algorithmic complexity and needs extensive memory to be used in large-scale tasks. 22/24
1Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning : From theory to algorithms. Cambridge University
2Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Second Edition. MIT Press,
23/24
24/24
24/24