L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

l ecture 8 d ual and k ernels
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2


slide-1
SLIDE 1

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

  • Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 8: DUAL AND KERNELS

slide-2
SLIDE 2

CS446 Machine Learning

Admin

2

slide-3
SLIDE 3

Reminder: Homework Late Policy

Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than two days after their due date. Let us know if there are any special circumstances (family, health, etc.)

slide-4
SLIDE 4

CS446 Machine Learning

Convergence checks

What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10-3) – Compute Δw, the difference between wold and wnew: Δw = wold − wnew – w has converged when ‖Δw‖< τ

4

slide-5
SLIDE 5

CS446 Machine Learning

Convergence checks

How often do I check for convergence? Batch learning: wold = w before seeing the current batch wnew = w after seeing the current batch Assuming your batch is large enough, this works well.

5

slide-6
SLIDE 6

CS446 Machine Learning

Convergence checks

How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). wold = w after n·k examples/updates wnew = w after (n+1)·k examples/updates

6

slide-7
SLIDE 7

CS446 Machine Learning

Back to linear classifiers….

7

slide-8
SLIDE 8

CS446 Machine Learning

Linear classifiers so far…

What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary

8

slide-9
SLIDE 9

CS446 Machine Learning

Data are not linearly separability

9

Noise / outliers Target function is not linear in X

slide-10
SLIDE 10

CS446 Machine Learning

Today’s key concepts

Kernel trick: Dealing with target functions that are not linearly separable. This requires us to move to the dual representation.

10

slide-11
SLIDE 11

CS446 Machine Learning

Dual representation

  • f linear classifiers

11

slide-12
SLIDE 12

CS446 Machine Learning

Dual representation

Recall the Perceptron update rule: If xm is misclassified, add ym·xm to w if ym·f(xm) = ym·w·xm < 0:

w := w + ym·xm

Dual representation: Write w as a weighted sum of training items: w = ∑n αn yn xn αn: how often was xn misclassified? f(x) = w·x = ∑n αn yn xn ·x

12

slide-13
SLIDE 13

CS446 Machine Learning

Dual representation

Primal Perceptron update rule: If xm is misclassified, add ym·xm to w if ym·f(xm) = ym·w·xm < 0:

w := w + ym·xm

Dual Perceptron update rule: If xm is misclassified, add 1 to αm if ym· ∑d αdxd·xm < 0:

αm

:= αm + 1

13

slide-14
SLIDE 14

CS446 Machine Learning

Dual representation

Classifying x in the primal: f(x) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f(x) = ∑n αn yn xn x αn = weight of n-th training example (to be learned) xn x = dot product between xn and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn)

14

slide-15
SLIDE 15

Kernels

slide-16
SLIDE 16

CS446 Machine Learning

Making data linearly separable

16

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 x2 x1 Original feature space

f(x) = 1 iff x1

2 + x2 2 ≤ 1

slide-17
SLIDE 17

0.5 1 1.5 2 0.5 1 1.5 2 x2*x2 x1*x1 Transformed feature space

CS446 Machine Learning

Making data linearly separable

17

Transform data: x = (x1, x2

) => x’ = (x1 2, x2 2 )

f(x’) = 1 iff x’1 + x’2 ≤ 1

slide-18
SLIDE 18

CS446 Machine Learning

Making data linearly separable

These data aren’t linearly separable in the x1 space But adding a second dimension with x2 = x1

2

makes them linearly separable in 〈x1, x2〉:

18

x1 x1 x1

2

slide-19
SLIDE 19

CS446 Machine Learning

Making data linearly separable

It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x2) – include transformed features in addition to the original features – capture interactions between features (e.g. x3 = x1x2) But this may blow up the number of features

19

slide-20
SLIDE 20

CS446 Machine Learning

Making data linearly separable

We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected

20

slide-21
SLIDE 21

CS446 Machine Learning

The kernel trick

– Define a feature function φ(x) which maps items x into a higher-dimensional space. – The kernel function K(xi, xj) computes the inner product between the φ(xi) and φ(xj) K(xi, xj) = φ(xi)φ(xj) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K(xi, xj)

21

slide-22
SLIDE 22

CS446 Machine Learning

Quadratic kernel

Original features: x = (a, b) Transformed features: φ(x) = (a2, b2, √2·ab) Dot product in transformed space: φ(x1)·φ(x2) = a1

2a2 2 + b1 2b2 2, 2·a1b1a2b2

= (x1·x2)2 Kernel: K(x1, x2) = (x1·x2)2 = φ(x1)·φ(x2)

22

slide-23
SLIDE 23

CS446 Machine Learning

Polynomial kernels

Polynomial kernel of degree p: – Basic form K(xi,xj) = (xi ·xj)p – Standard form (captures all lower order terms): K(xi,xj) = (xi ·xj + 1)p

23

slide-24
SLIDE 24

CS446 Machine Learning

From dual to kernel perceptron

Dual Perceptron: f(x) = ∑d αd xd·xm

Update: If xm is misclassified, add 1 to αm

if ym· ∑d αd xd·xm < 0:

αm

:= αm + 1

Kernel Perceptron: f(x) = ∑d αd φ(xd)·φ(xm) = ∑d αd K(xd·xm)

Update: If xm is misclassified, add 1 to αm

if ym· ∑d αd K(xd·xm) < 0:

αm

:= αm + 1

24

slide-25
SLIDE 25

CS446 Machine Learning

Primal and dual representation

Linear classifier (primal representation): w defines weights of features of x f(x) = w·x Linear classifier (dual representation): Rewrite w as a (weighted) sum

  • f training items:

w = ∑n αn yn xn f(x) = w·x = ∑n αn yn xn ·x

25

slide-26
SLIDE 26

CS446 Machine Learning

The kernel trick

– Define a feature function φ(x) which maps items x into a higher-dimensional space. – The kernel function K(xi, xj) computes the inner product between the φ(xi) and φ(xj) K(xi, xj) = φ(xi)φ(xj) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K(xi, xj)

26

slide-27
SLIDE 27

CS446 Machine Learning

The kernel matrix

The kernel matrix of a data set D = {x1, …, xn} defined by a kernel function k(x, z) = φ(x)φ(z) is the n×n matrix K with Kij = k(xi, xj) You’ll also find the term ‘Gram matrix’ used: – The Gram matrix of a set of n vectors S = {x1…xn} is the n×n matrix G with Gij = xixj – The kernel matrix is the Gram matrix of {φ(x1), …,φ(xn)}

27

slide-28
SLIDE 28

CS446 Machine Learning

Properties of the kernel matrix K

K is symmetric: Kij = k(xi, xj) = φ(xi)φ(xj) = k(xj, xi) = Kji K is positive semi-definite (∀ vectors v: vTKv ≥0): Proof:

28

vTKv = vivjKij

j=1 D

i=1 D

= vivj φ(xi),φ(x j)

j=1 D

i=1 D

= vivj φk(xi)

k=1 N

j=1 D

i=1 D

⋅φk(x j) = viφk(xi)

j=1 D

i=1 D

k=1 N

⋅vjφk(x j) = viφk(xi)

i=1 D

# $ % & ' (

2 k=1 N

≥ 0

slide-29
SLIDE 29

CS446 Machine Learning

Quadratic kernel (1)

K(x, z) = (xz)2 This corresponds to a feature space which contains only terms of degree 2 (products of two features) (for x = (x1, x2) in R2, these are x1x1, x1x2, x2x2)

For x = (x1, x2), z = (z1, z2): K(x, z)

= (xz)2 = x1 2z1 2 + 2x1z1x2z2 + x2 2z2 2

= φ(x)·φ(z) Hence, φ(x) = (x1

2 , √2·x1x2, x2 2)

29

slide-30
SLIDE 30

CS446 Machine Learning

Quadratic kernel (2)

K(x, z) = (xz + c)2 This corresponds to a feature space which contains constants, linear terms (original features), as well as terms of degree 2 (products of two features) (for x = (x1, x2) in R2: x1, x2, x1x1, x1x2, x2x2)

30

slide-31
SLIDE 31

CS446 Machine Learning

Polynomial kernels

– Linear kernel: k(x, z) = xz – Polynomial kernel of degree d: (only dth-order interactions): k(x, z) = (xz)d – Polynomial kernel up to degree d: (all interactions of order d or lower: k(x, z) = (xz + c)d with c > 0

31

slide-32
SLIDE 32

CS446 Machine Learning

Constructing new kernels from

  • ne existing kernel k(x, x’)

You can construct new kernels k’(x, x’) from k(x, x’) by: – Multiplying k(x, x’) by a constant c: k’(x, x’) = ck(x, x’) – Multiplying k(x, x’) by a function f applied to x and x’: k’(x, x’) = f(x)k(x, x’)f(x’) – Applying a polynomial (with non-negative coefficients) to k(x, x’): k’(x, x’) = P( k(x, x’) ) with P(z) = ∑i aizi and ai≥0 – Exponentiating k(x, x’): k’(x, x’) = exp(k(x, x’))

32

slide-33
SLIDE 33

CS446 Machine Learning

Constructing new kernels by combining two kernels k1(x, x’), k2(x, x’)

You can construct k’(x, x’) from k1(x, x’), k2(x, x’) by: – Adding k1(x, x’) and k2(x, x’): k’(x, x’) = k1(x, x’) + k2(x, x’) – Multiplying k1(x, x’) and k2(x, x’): k’(x, x’) = k1(x, x’)k2(x, x’)

33

slide-34
SLIDE 34

CS446 Machine Learning

Constructing new kernels

– If φ(x) ∈ Rm and km(z, z’) a valid kernel in Rm, k(x, x’) = km(φ(x), φ(x’)) is also a valid kernel – If A is a symmetric positive semi-definite matrix, k(x, x’) = xAx’ is also a valid kernel

34

slide-35
SLIDE 35

CS446 Machine Learning

Normalizing a kernel

35

k'(x,z) = k(x,z) k(x,x)k(z,z) = φ(x)φ(z) φ(x)φ(x)φ(z)φ(z) = φ(x)φ(z) φ(x) φ(z) = ψ(x)ψ(z) with ψ(x) = φ(x) φ(x)

Recall: you can normalize any vector x (transform it into a unit vector that has the same direction as x) by

ˆ x = x x = x x1

2 +...+ xN 2

slide-36
SLIDE 36

CS446 Machine Learning

Gaussian kernel (aka radial basis function kernel)

k(x, z) = exp( −‖x − z‖2/c)

‖x − z‖2: squared Euclidean distance between x and z c (often called σ2): a free parameter very small c: K ≈ identity matrix (every item is different) very large c: K ≈ unit matrix (all items are the same)

– k(x, z) ≈ 1 when x, z close – k(x, z) ≈ 0 when x, z dissimilar

36

slide-37
SLIDE 37

CS446 Machine Learning

Gaussian kernel (aka radial basis function kernel)

k(x, z) = exp( −‖x − z‖2/c)

This is a valid kernel because: k(x, z) = exp( −‖x − z‖2/2σ2) = exp( −(xx + zz − 2xz)/2σ2) = exp(−xx/2σ2)exp(xz/σ2) exp(−zz/2σ2) = f(x) exp(xz/σ2) f(z) exp(xz/σ2) is a valid kernel: – xz is the linear kernel; – we can multiply kernels by constants (1/σ2) – we can exponentiate kernels

37

slide-38
SLIDE 38

CS446 Machine Learning

Kernels over (finite) sets

X, Z: subsets of a finite set D with |D| elements k(X, Z) = |X∩Z| (the number of elements in X and Z) is a valid kernel:

k(X, Z) = φ(X)φ(Z) where φ(X) maps X to a bit vector of length |D| (i-th bit: does X contains the i-th element of D?).

k(X, Z) = 2|X∩Z| (the number of subsets shared by X and Z) is a valid kernel:

φ(X) maps X to a bit vector of length 2|D| (i-th bit: does X contains the i-th subset of D?)

38