Introduction to Machine Learning
Non-linear prediction with kernels
- Prof. Andreas Krause
Learning and Adaptive Systems (las.ethz.ch)
Introduction to Machine Learning Non-linear prediction with kernels - - PowerPoint PPT Presentation
Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Solving non-linear classification tasks How can we find nonlinear classification boundaries? Similar as in
Learning and Adaptive Systems (las.ethz.ch)
How can we find nonlinear classification boundaries? Similar as in regression, can use non-linear transformations of the feature vectors, followed by linear classification
2
Need O(dk) dimensions to represent (multivariate) polynomials of degree k on d features Example: d=10000, k=2 è Need ~100M dimensions In the following, we can see how we can efficiently implicitly operate in such high-dimensional feature spaces (i.e., without ever explicitly computing the transformation)
3
Express problem s.t. it only depends on inner products Replace inner products by kernels Example: Perceptron Will see further examples later
4
α1:n
n
i=1
n
j=1
i xj}
α1:n
n
i=1
n
j=1
Initialize For t=1,2,...
Pick data point (xi, ,yi) uniformly at random Predict If set
For new point x, predict
5
ˆ y = sign ⇣ n X
j=1
αjyjk(xj, xi) ⌘ ˆ y 6= yi αi ← αi + ηt α1 = · · · = αn = 0 Training Prediction ˆ y = sign ⇣ n X
j=1
αjyjk(xj, x) ⌘
What are valid kernels? How can we select a good kernel for our problem? Can we use kernels beyond the perceptron? Kernels work in very high-dimensional spaces. Doesn‘t this lead to overfitting?
6
Data space X A kernel is a function satisfying 1) Symmetry: For any it must hold that 2) Positive semi-definiteness: For any n, any set , the kernel (Gram) matrix must be positive semi-definite
7
k : X × X → R S = {x1, . . . , xn} ⊆ X K = k(x1, x1) . . . k(x1, xn) . . . . . . k(xn, x1) . . . k(xn, xn) x, x0 ∈ X k(x, x0) = k(x0, x)
Linear kernel: Polynomial kernel: Gaussian (RBF, squared exp. kernel): Laplacian kernel:
8
k(x, x0) = xT x0 k(x, x0) = (xT x0 + 1)d k(x, x0) = exp(−||x − x0||2
2/h2)
k(x, x0) = exp(−||x − x0||1/h)
9
Given kernel k, predictors (for kernelized classification) have the form
10
ˆ y = sign ⇣ n X
j=1
αjyjk(xj, x) ⌘
11
0.2 0.4 0.6 0.8 1
1 2
Bandwidth h=.1
100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2 0.4 0.6 0.8 1
0.5 1 1.5 2 2.5 3
Bandwidth h=.3 k(x, x0) = exp(−||x − x0||2
2/h2)
f(x) =
n
X
i=1
αik(xi, x)
12
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5 1 1.5
Bandwidth h=1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5 1 1.5 2 2.5
Bandwidth h=.3
100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1f(x) =
n
X
i=1
αik(xi, x)
13
Can define kernels on a variety of objects: Sequence kernels Graph kernels Diffusion kernels Kernels on probability distributions ...
14
Can define a kernel for measuring similarity between graphs by comparing random walks on both graphs (not further defined here)
15
[Borgwardt et al.]
Can measure similarity among nodes in a graph via diffusion kernels (not defined here)
16
s1 s2 s3 s4 s5 s7 s6 s11 s12 s9 s10 s8 s1 s3 s12 s9
Suppose we have two kernels defined on data space X Then the following functions are valid kernels: where f is a polynomial with positive coefficients or the exponential function
17
k1 : X × X → R k2 : X × X → R k(x, x0) = k1(x, x0) + k2(x, x0) k(x, x0) = c k1(x, x0) for c > 0 k(x, x0) = k1(x, x0) k2(x, x0) k(x, x0) = f(k1(x, x0))
18
19
−1 −0.5 0.5 1 −1 −0.5 0.5 1 −10 −8 −6 −4 −2 2 4 6 8Actions Contexts Payoffs
−1 −0.5 0.5 1 −1 −0.5 0.5 1 −3 −2 −1 1 2 3Actions Contexts Payoffs
May want to use kernels to model pairwise data (users x products; genes x patients; ...)
We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next questions:
What kind of predictors / decision boundaries do kernel methods entail? Can we use the kernel trick beyond the perceptron?
20
Recall Perceptron (and SVM) classification rule: Consider Gaussian kernel
21
i=1
k(x, x0) = exp(−||x − x0||2/h2)
For data point x, predict majority of labels of k nearest neighbors
22
i=1
23
For data point x, predict majority of labels of k nearest neighbors How to choose k?
24
i=1
k-Nearest Neighbor: Kernel Perceptron:
25
i=1
i=1
26
Method k-NN Kernelized Perceptron Advantages No training necessary Optimized weights can lead to improved performance Can capture „global trends“ with suitable kernels Depends on „wrongly classified“ examples only Disadvantages Depends on all data è inefficient Training requires
Parametric models have finite set of parameters Example: Linear regression, linear Perceptron, ... Nonparametric models grow in complexity with the size of the data
Potentially much more expressive But also more computationally complex – Why?
Example: Kernelized Perceptron, k-NN, ... Kernels provide a principled way of deriving non- parametric models from parametric ones
27
We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next question:
Can we use the kernel trick beyond the perceptron?
28
The support vector machine can also be kernelized
29
w
n
i=1
2
30
w
n
i=1
2
31
w
n
i=1
2
Learning: Solve the problem Prediction: For data point x predict label y as
32
Per- ceptron: SVM: Or:
arg min
α
1 n
n
X
i=1
max{0, 1 − yiαT ki} + λαT DyKDyα
arg min
α
1 n
n
X
i=1
max{0, − yiαT ki}
i=1
33
From linear to nonlinear regression: Can also kernelize linear regression Predictor has the form
34
f(x) =
n
X
i=1
αik(xi, x) + + + + + + + + + x f(x) + + ++ + + + + + + + + + + + + x f(x)
Original (parametric) linear optimization problem Similar as in perceptron, optimal lies in span of data:
35
w
n
i=1
2
n
i=1
36
ˆ w = arg min
w
1 n
n
X
i=1
⇣ wT xi − yi ⌘2 + λ||w||2
2
ˆ w =
n
X
i=1
αixi
37
K = k(x1, x1) . . . k(x1, xn) . . . . . . k(xn, x1) . . . k(xn, xn)
ˆ α = arg min
α1:n
1 n
n
X
i=1
⇣ n X
j=1
αjk(xi, xj) − yi ⌘2 + λαT Kα
Learning: Solve least squares problem Closed-form solution: Prediction: For data point x predict response y as
38
n
i=1
α
2 + λαT Kα
39
What if ?
40
k(x, x0) = xT x0
Often, parametric models are too „rigid“, and non- parametric models fail to extrapolate Solution: Use additive combination of linear and non- linear kernel function
41
2/h2) + c2xT x0
42
43
44
[with Phil Romero, Frances Arnold PNAS‘13]
45
46
Parent sequences Candidate designs
ABC 1 2 3 ... n
47
Thermostability x x x
[with Romero, Arnold, PNAS ‘13]
48
[w Romero, Arnold PNAS ’13] Identification of new thermostable P450s chimera 5.3C more stable than best published sequence!
49
For a given kernel, how should we choose parameters?
How should we select suitable kernels?
Domain knowledge (dependent on data type) «Brute force» (or heuristic) search Use cross-validation
Learning kernels
Much research on automatically selecting good kernels (Multiple Kernel Learning; Hyperkernels; etc.)
50
51
Kernels map to (very) high-dimensional spaces. Why do we hope to be able to learn? First attempt of an answer: (typically) # parameters << # dimensions. Why? Number of parameters = number of data points („non-parametric learning“)
52
Kernels map to (very) high-dimensional spaces. Why do we hope to be able to learn? Second attempt of an answer: Overfitting can of course happen (if we choose poor parameters) Can combat overfitting by regularization
This is already built into kernelized linear regression (and SVMs), but not the kernelized Perceptron
53
KLR: SVM:
α
2 + λαT Kα
ˆ α = arg min
α
1 n
n
X
i=1
max{0, 1 − yiαT ki} + λαT DyKDyα
Kernels are
(efficient, implicit) inner products Positive (semi-)definite functions Many examples (linear, polynomial, Gaussian/RBF, ...)
The „Kernel trick“
Reformulate learning algorithm so that inner products appear Replace inner products by kernels
K-Nearest Neighbor classifier (and relation to Perceptron) How to choose kernels (kernel engineering etc.) Applications: Kernelized Perceptron / SVM; kernelized linear regression
54
55
Least squares Regression Perceptron Ridge Regression Linear SVM Kernelized Regression Kernelized SVM
l2-regularizer l2-regularizer K e r n e l s Kernels Loss funct. Loss funct.
k-NN
„ S p e c i a l c a s e “
Kernelized Perceptron
Kernels Loss funct.
Lasso
l1-regular.
l1-SVM
l1-regular.
56
Model/
Loss-function + Regularization
Squared loss, 0/1 loss, Perceptron loss, Hinge loss L2 norm, L1 norm
Method:
Exact solution, Gradient Descent, (mini-batch) SGD, Convex Programming, …
Model selection:
K-fold Cross-Validation, Monte Carlo CV
Representation/ features
Linear hypotheses; nonlinear hypotheses with nonlinear feature transforms; kernels
Evaluation metric:
Mean squared error, Accuracy