Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - - PowerPoint PPT Presentation

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 Concatenated (combined) features XOR: x = (x 1 , x 2 , x 1 x 2 )


slide-1
SLIDE 1

Machine Learning

Fall 2017

Professor Liang Huang

Kernels

(Kernels, Kernelized Perceptron and SVM)

(Chap. 12 of CIML)

slide-2
SLIDE 2
  • Concatenated (combined) features
  • XOR: x = (x1, x2, x1x2)
  • income: add “degree + major”
  • Perceptron
  • Map data into feature space
  • Solution in span of

Nonlinear Features

x → φ(x) φ(xi) x1: +1 x2: -1 x4: -1 x3: +1

slide-3
SLIDE 3

Quadratic Features

  • Separating surfaces are

Circles, hyperbolae, parabolae

slide-4
SLIDE 4

Kernels as dot products

Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products

  • implicitly. For some features this works . . .

Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .

5 · 105

slide-5
SLIDE 5

Quadratic Kernel

Quadratic Features in R2 Φ(x) := ⇣ x2

1,

p 2x1x2, x2

2

⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2

1,

p 2x1x2, x2

2

⌘ , ⇣ x0

1 2,

p 2x0

1x0 2, x0 2 2⌘E

= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id. x1: +1 x2: -1 x4: -1 x3: +1

for x in ℝn, quadratic ɸ: naive: ɸ(x): O(n2) ɸ(x)∙ɸ(x’): O(n2) kernel k(x,x’): O(n)

= k(x, x0)

slide-6
SLIDE 6

The Perceptron on features

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is (implicitly) a linear combination of

inner products

initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b)  0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end

w = X

i∈I

αiφ(xi) f(x) = X

i∈I

αi hφ(xi), φ(x)i

slide-7
SLIDE 7

Kernelized Perceptron

  • instead of updating w, now update αi
  • Weight vector is linear combination
  • Classifier is linear combination of inner products

initialize f = 0 repeat Pick (xi, yi) from data if yif(xi) ≤ 0 then f(·) ← f(·) + yik(xi, ·) + yi until yif(xi) > 0 for all i end

f(x) = X

i∈I

αi hφ(xi), φ(x)i = X

i∈I

αik(xi, x) w = X

i∈I

αiφ(xi)

Functional Form

αi ← αi + yi increase its vote by 1

slide-8
SLIDE 8

Kernelized Perceptron

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of inner products

Dual Form

update linear coefficients

implicitly equivalent to:

Primal Form

update weights classify w ← w + yiφ(xi) f(k) = w · φ(x)

αi ← αi + yi

w = X

i∈I

αiφ(xi)

w = X

i∈I

αiφ(xi) f(x) = X

i∈I

αi hφ(xi), φ(x)i = X

i∈I

αik(xi, x)

slide-9
SLIDE 9

Kernelized Perceptron

Dual Form

update linear coefficients

implicitly equivalent to:

Primal Form

update weights classify w ← w + yiφ(xi)

classify

f(k) = w · φ(x)

w = X

i∈I

αiφ(xi)

αi ← αi + yi

f(x) = w · φ(x) = [ X

i∈I

αiφ(xi)]φ(x) = X

i∈I

αihφ(xi), φ(x)i = X

i∈I

αik(xi, x)

fast

O(d)

slow

O(d2)

slide-10
SLIDE 10

Kernelized Perceptron

initialize for all repeat Pick from data if then until for all

αi = 0 (xi, yi) yif(xi) ≤ 0 αi ← αi + yi yif(xi) > 0 i i

Dual Form

update linear coefficients

implicitly

classify

αi ← αi + yi

w = X

i∈I

αiφ(xi) f(x) = w · φ(x) = [ X

i∈I

αiφ(xi)]φ(x) = X

i∈I

αihφ(xi), φ(x)i = X

i∈I

αik(xi, x) if #features >> #examples, dual is easier;

  • therwise primal is easier

fast

O(d)

slow

O(d2)

slide-11
SLIDE 11

Kernelized Perceptron

Dual Perceptron

update linear coefficients

implicitly

Primal Perceptron

update weights classify w ← w + yiφ(xi) f(k) = w · φ(x)

αi ← αi + yi

w = X

i∈I

αiφ(xi) if #features >> #examples, dual is easier;

  • therwise primal is easier

Q: when is #features >> #examples? A: higher-order polynomial kernels

  • r exponential kernels (inf. dim.)
slide-12
SLIDE 12

Kernelized Perceptron

Dual Perceptron

update linear coefficients

implicitly

classify

Pros/Cons of Kernel in Dual

  • pros:
  • no need to compute ɸ(x) (time)
  • no need to store ɸ(x) and w

(memory)

  • cons:
  • sum over all misclassified

training examples for test

  • need to store all misclassified

training examples (memory)

  • called “support vector set”
  • SVM will minimize this set!

αi ← αi + yi

w = X

i∈I

αiφ(xi) f(x) = w · φ(x) = [ X

i∈I

αiφ(xi)]φ(x) = X

i∈I

αihφ(xi), φ(x)i = X

i∈I

αik(xi, x)

fast

O(d)

slow

O(d2)

slide-13
SLIDE 13

Kernelized Perceptron

Dual Perceptron Primal Perceptron

update on new param. x1: -1 w = (0, -1) x2: +1 w = (2, 0) x3: +1 w = (2, -1) update on new param. w (implicit) x1: -1 α = (-1, 0, 0)

  • x1

x2: +1 α = (-1, 1, 0)

  • x1 + x2

x3: +1 α = (-1, 1, 1) -x1 + x2 + x3 linear kernel (identity map) final implicit w = (2, -1)

x2(2, 1) : +1

x3(0, −1) : +1

x1(0, 1) : −1

geometric interpretation

  • f dual classification:

sum of dot-products with x2 & x3 bigger than dot-product with x1

(agreement w/ positive > w/ negative)

slide-14
SLIDE 14

XOR Example

Dual Perceptron

update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0)

φ(x1)

x2: -1 α = (+1, -1, 0, 0)

φ(x1) - φ(x2)

x1: +1 x2: -1 x4: -1 x3: +1

classification rule in dual/geom:

(x · x1)2 > (x · x2)2 ⇒ cos2 θ1 > cos2 θ2 ⇒ | cos θ1| > | cos θ2|

x1: +1 x2: -1

in dual/algebra:

(x · x1)2 > (x · x2)2 ⇒ (x1 + x2)2 > (x1 − x2)2 ⇒ x1x2 > 0

also verify in primal

k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2

1, x2 2,

√ 2x1x2)

w = (0, 0, 2 √ 2)

slide-15
SLIDE 15

Circle Example??

Dual Perceptron

update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0)

φ(x1)

x2: -1 α = (+1, -1, 0, 0)

φ(x1) - φ(x2)

k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2

1, x2 2,

√ 2x1x2)

slide-16
SLIDE 16

Polynomial Kernels

Idea We want to extend k(x, x0) = hx, x0i2 to k(x, x0) = (hx, x0i + c)d where c > 0 and d 2 N. Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. k(x, x0) = (hx, x0i + c)d =

m

X

i=0

✓d i ◆ (hx, x0i)i cdi Individual terms (hx, x0i)i are dot products for some Φi(x).

+c is just augmenting space. simpler proof: set x0 = sqrt(c)

slide-17
SLIDE 17

Circle Example

Dual Perceptron

y +1 +1

  • 1

update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0, 0)

φ(x1)

x2: -1 α = (+1, -1, 0, 0, 0) φ(x1) - φ(x2) x3: -1 α = (+1, -1, -1, 0, 0)

k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2

1, x2 2,

√ 2x1x2) k(x, x0) = (x · x0 + 1)2 ⇔ φ(x) =?

x1 x2 x3 x4 x5

slide-18
SLIDE 18

Examples

Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

  • λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)

  • Cond. Expectation

Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.

you only need to know polynomial and gaussian.

distorts distance distorts angle

slide-19
SLIDE 19

Kernel Summary

  • For a feature map ɸ, find a magic function k, s.t.:
  • the dot-product ɸ(x)∙ɸ(x’) = k(x, x’)
  • this k(x, x’) should be much faster than ɸ(x)
  • k(x, x’) should be computable in O(n) if x in ℝn
  • ɸ(x) is much slower: O(nd) for poly d, more for Gaussian
  • But for any k function, is there a ɸ s.t. ɸ(x)∙ɸ(x’) = k(x,x’)?

Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

  • λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline

slide-20
SLIDE 20

The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z

X⇥X

k(x, x0)f(x)f(x0)dxdx0 0 for all f 2 L2(X) there exist φi : X ! R and numbers λi 0 where k(x, x0) = X

i

λiφi(x)φi(x0) for all x, x0 2 X. Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k(xi, xj)αiαj 0

Mercer’s Theorem

slide-21
SLIDE 21

Properties

Distance in Feature Space Distance between points in feature space via d(x, x0)2 :=kΦ(x) Φ(x0)k2 =hΦ(x), Φ(x)i 2hΦ(x), Φ(x0)i + hΦ(x0), Φ(x0)i =k(x, x) + k(x0, x0) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by Kij = hΦ(xi), Φ(xj)i = k(xi, xj) where xi are the training patterns. Similarity Measure The entries Kij tell us the overlap between Φ(xi) and Φ(xj), so k(xi, xj) is a similarity measure.

slide-22
SLIDE 22

Kernelized Pegasos for SVM

for HW2, you don’t need to randomly choose training examples. just go over all training examples in the original order, and call that an epoch (same as HW1).

slide-23
SLIDE 23

σ = 1.0 C = ∞

f(x) = 1 f(x) = 0

f(x) = −1

f(x) =

N

X

i

αiyi exp

³

−||x − xi||2/2σ2´ + b

Gaussian RBF kernel (default in sklearn)

slide-24
SLIDE 24

σ = 1.0 C = 100

Decrease C, gives wider (soft) margin

slide-25
SLIDE 25

σ = 1.0 C = 10

f(x) =

N

X

i

αiyi exp

³

−||x − xi||2/2σ2´ + b

slide-26
SLIDE 26

σ = 1.0 C = ∞

f(x) =

N

X

i

αiyi exp

³

−||x − xi||2/2σ2´ + b

slide-27
SLIDE 27

σ = 0.25 C = ∞

Decrease sigma, moves towards nearest neighbour classifier

slide-28
SLIDE 28

σ = 0.1 C = ∞

f(x) =

N

X

i

αiyi exp

³

−||x − xi||2/2σ2´ + b

slide-29
SLIDE 29

Polynomial Kernels

this is in contrast with C: smaller C => wide margin (underfitting) larger C => narrow margin (overfitting)

slide-30
SLIDE 30

Overfitting vs. Overfitting

slide-31
SLIDE 31

From SVM to Nearest Neighbor

  • for each test example x, decide its label by the

training example closest to x

  • decision boundary highly non-linear (Voronoi)
  • k-nearest neighbor (k-NN): smoother boundaries
slide-32
SLIDE 32

K = 1

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.0 error = 0.15 Training data Testing data

K = 3

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.0760 error = 0.1340 Training data Testing data

K = 7

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.1320 error = 0.1110 Training data Testing data

K = 21

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.1120 error = 0.0920 Training data Testing data

slide-33
SLIDE 33

K = 1

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.0 error = 0.15 Training data Testing data

K = 3

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.0760 error = 0.1340 Training data Testing data

K = 7

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.1320 error = 0.1110 Training data Testing data

K = 21

  • 1.5
  • 1
  • 0.5
0.5 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5
  • 0.2
0.2 0.4 0.6 0.8 1 1.2

error = 0.1120 error = 0.0920 Training data Testing data

small k: overfitting large k: underfitting

what about k=N?

slide-34
SLIDE 34

SVM vs. Nearest Neighbor

support vectors few all

slide-35
SLIDE 35
slide-36
SLIDE 36

b a c e f d