CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
- Prof. Julia Hockenmaier
juliahmr@illinois.edu
L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation
CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2
CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
juliahmr@illinois.edu
CS446 Machine Learning
2
Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than two days after their due date. Let us know if there are any special circumstances (family, health, etc.)
CS446 Machine Learning
What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10-3) – Compute Δw, the difference between wold and wnew: Δw = wold − wnew – w has converged when ‖Δw‖< τ
4
CS446 Machine Learning
How often do I check for convergence? Batch learning: wold = w before seeing the current batch wnew = w after seeing the current batch Assuming your batch is large enough, this works well.
5
CS446 Machine Learning
How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). wold = w after n·k examples/updates wnew = w after (n+1)·k examples/updates
6
CS446 Machine Learning
7
CS446 Machine Learning
What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary
8
CS446 Machine Learning
9
Noise / outliers Target function is not linear in X
CS446 Machine Learning
Kernel trick: Dealing with target functions that are not linearly separable. This requires us to move to the dual representation.
10
CS446 Machine Learning
11
CS446 Machine Learning
Recall the Perceptron update rule: If xm is misclassified, add ym·xm to w if ym·f(xm) = ym·w·xm < 0:
w := w + ym·xm
Dual representation: Write w as a weighted sum of training items: w = ∑n αn yn xn αn: how often was xn misclassified? f(x) = w·x = ∑n αn yn xn ·x
12
CS446 Machine Learning
Primal Perceptron update rule: If xm is misclassified, add ym·xm to w if ym·f(xm) = ym·w·xm < 0:
w := w + ym·xm
Dual Perceptron update rule: If xm is misclassified, add 1 to αm if ym· ∑d αdxd·xm < 0:
αm
:= αm + 1
13
CS446 Machine Learning
Classifying x in the primal: f(x) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f(x) = ∑n αn yn xn x αn = weight of n-th training example (to be learned) xn x = dot product between xn and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn)
14
CS446 Machine Learning
16
0.5 1 1.5 2
0.5 1 1.5 2 x2 x1 Original feature space
f(x) = 1 iff x1
2 + x2 2 ≤ 1
0.5 1 1.5 2 0.5 1 1.5 2 x2*x2 x1*x1 Transformed feature space
CS446 Machine Learning
17
Transform data: x = (x1, x2
) => x’ = (x1 2, x2 2 )
f(x’) = 1 iff x’1 + x’2 ≤ 1
CS446 Machine Learning
These data aren’t linearly separable in the x1 space But adding a second dimension with x2 = x1
2
makes them linearly separable in 〈x1, x2〉:
18
x1 x1 x1
2
CS446 Machine Learning
It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x2) – include transformed features in addition to the original features – capture interactions between features (e.g. x3 = x1x2) But this may blow up the number of features
19
CS446 Machine Learning
We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected
20
CS446 Machine Learning
– Define a feature function φ(x) which maps items x into a higher-dimensional space. – The kernel function K(xi, xj) computes the inner product between the φ(xi) and φ(xj) K(xi, xj) = φ(xi)φ(xj) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K(xi, xj)
21
CS446 Machine Learning
Original features: x = (a, b) Transformed features: φ(x) = (a2, b2, √2·ab) Dot product in transformed space: φ(x1)·φ(x2) = a1
2a2 2 + b1 2b2 2, 2·a1b1a2b2
= (x1·x2)2 Kernel: K(x1, x2) = (x1·x2)2 = φ(x1)·φ(x2)
22
CS446 Machine Learning
Polynomial kernel of degree p: – Basic form K(xi,xj) = (xi ·xj)p – Standard form (captures all lower order terms): K(xi,xj) = (xi ·xj + 1)p
23
CS446 Machine Learning
Dual Perceptron: f(x) = ∑d αd xd·xm
Update: If xm is misclassified, add 1 to αm
if ym· ∑d αd xd·xm < 0:
αm
:= αm + 1
Kernel Perceptron: f(x) = ∑d αd φ(xd)·φ(xm) = ∑d αd K(xd·xm)
Update: If xm is misclassified, add 1 to αm
if ym· ∑d αd K(xd·xm) < 0:
αm
:= αm + 1
24
CS446 Machine Learning
Linear classifier (primal representation): w defines weights of features of x f(x) = w·x Linear classifier (dual representation): Rewrite w as a (weighted) sum
w = ∑n αn yn xn f(x) = w·x = ∑n αn yn xn ·x
25
CS446 Machine Learning
– Define a feature function φ(x) which maps items x into a higher-dimensional space. – The kernel function K(xi, xj) computes the inner product between the φ(xi) and φ(xj) K(xi, xj) = φ(xi)φ(xj) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K(xi, xj)
26
CS446 Machine Learning
The kernel matrix of a data set D = {x1, …, xn} defined by a kernel function k(x, z) = φ(x)φ(z) is the n×n matrix K with Kij = k(xi, xj) You’ll also find the term ‘Gram matrix’ used: – The Gram matrix of a set of n vectors S = {x1…xn} is the n×n matrix G with Gij = xixj – The kernel matrix is the Gram matrix of {φ(x1), …,φ(xn)}
27
CS446 Machine Learning
K is symmetric: Kij = k(xi, xj) = φ(xi)φ(xj) = k(xj, xi) = Kji K is positive semi-definite (∀ vectors v: vTKv ≥0): Proof:
28
vTKv = vivjKij
j=1 D
i=1 D
= vivj φ(xi),φ(x j)
j=1 D
i=1 D
= vivj φk(xi)
k=1 N
j=1 D
i=1 D
⋅φk(x j) = viφk(xi)
j=1 D
i=1 D
k=1 N
⋅vjφk(x j) = viφk(xi)
i=1 D
# $ % & ' (
2 k=1 N
≥ 0
CS446 Machine Learning
K(x, z) = (xz)2 This corresponds to a feature space which contains only terms of degree 2 (products of two features) (for x = (x1, x2) in R2, these are x1x1, x1x2, x2x2)
For x = (x1, x2), z = (z1, z2): K(x, z)
= (xz)2 = x1 2z1 2 + 2x1z1x2z2 + x2 2z2 2
= φ(x)·φ(z) Hence, φ(x) = (x1
2 , √2·x1x2, x2 2)
29
CS446 Machine Learning
K(x, z) = (xz + c)2 This corresponds to a feature space which contains constants, linear terms (original features), as well as terms of degree 2 (products of two features) (for x = (x1, x2) in R2: x1, x2, x1x1, x1x2, x2x2)
30
CS446 Machine Learning
– Linear kernel: k(x, z) = xz – Polynomial kernel of degree d: (only dth-order interactions): k(x, z) = (xz)d – Polynomial kernel up to degree d: (all interactions of order d or lower: k(x, z) = (xz + c)d with c > 0
31
CS446 Machine Learning
You can construct new kernels k’(x, x’) from k(x, x’) by: – Multiplying k(x, x’) by a constant c: k’(x, x’) = ck(x, x’) – Multiplying k(x, x’) by a function f applied to x and x’: k’(x, x’) = f(x)k(x, x’)f(x’) – Applying a polynomial (with non-negative coefficients) to k(x, x’): k’(x, x’) = P( k(x, x’) ) with P(z) = ∑i aizi and ai≥0 – Exponentiating k(x, x’): k’(x, x’) = exp(k(x, x’))
32
CS446 Machine Learning
You can construct k’(x, x’) from k1(x, x’), k2(x, x’) by: – Adding k1(x, x’) and k2(x, x’): k’(x, x’) = k1(x, x’) + k2(x, x’) – Multiplying k1(x, x’) and k2(x, x’): k’(x, x’) = k1(x, x’)k2(x, x’)
33
CS446 Machine Learning
– If φ(x) ∈ Rm and km(z, z’) a valid kernel in Rm, k(x, x’) = km(φ(x), φ(x’)) is also a valid kernel – If A is a symmetric positive semi-definite matrix, k(x, x’) = xAx’ is also a valid kernel
34
CS446 Machine Learning
35
k'(x,z) = k(x,z) k(x,x)k(z,z) = φ(x)φ(z) φ(x)φ(x)φ(z)φ(z) = φ(x)φ(z) φ(x) φ(z) = ψ(x)ψ(z) with ψ(x) = φ(x) φ(x)
Recall: you can normalize any vector x (transform it into a unit vector that has the same direction as x) by
ˆ x = x x = x x1
2 +...+ xN 2
CS446 Machine Learning
k(x, z) = exp( −‖x − z‖2/c)
‖x − z‖2: squared Euclidean distance between x and z c (often called σ2): a free parameter very small c: K ≈ identity matrix (every item is different) very large c: K ≈ unit matrix (all items are the same)
– k(x, z) ≈ 1 when x, z close – k(x, z) ≈ 0 when x, z dissimilar
36
CS446 Machine Learning
k(x, z) = exp( −‖x − z‖2/c)
This is a valid kernel because: k(x, z) = exp( −‖x − z‖2/2σ2) = exp( −(xx + zz − 2xz)/2σ2) = exp(−xx/2σ2)exp(xz/σ2) exp(−zz/2σ2) = f(x) exp(xz/σ2) f(z) exp(xz/σ2) is a valid kernel: – xz is the linear kernel; – we can multiply kernels by constants (1/σ2) – we can exponentiate kernels
37
CS446 Machine Learning
X, Z: subsets of a finite set D with |D| elements k(X, Z) = |X∩Z| (the number of elements in X and Z) is a valid kernel:
k(X, Z) = φ(X)φ(Z) where φ(X) maps X to a bit vector of length |D| (i-th bit: does X contains the i-th element of D?).
k(X, Z) = 2|X∩Z| (the number of subsets shared by X and Z) is a valid kernel:
φ(X) maps X to a bit vector of length 2|D| (i-th bit: does X contains the i-th subset of D?)
38