Kernel Methods for regression and classification Prof. Mike Hughes - - PowerPoint PPT Presentation

kernel methods
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods for regression and classification Prof. Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Kernel Methods for regression and classification Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie,


slide-1
SLIDE 1

Kernel Methods

for regression and classification

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives for Day 19: Kernels

Big idea: Use kernel functions (similarity function with special properties) to obtain flexible high-dimensional feature transformations without explicit features

  • From linear regression (LR) to kernelized LR
  • What is a kernel function?
  • Basic properties
  • Example: Polynomial kernel
  • Example: Squared Exponential kernel
  • Kernels for classification
  • Logistic Regression
  • SVMs

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression & Classification

Supervised Learning Unsupervised Learning Reinforcement Learning

x y y

is a numeric variable e.g. sales in $$

slide-4
SLIDE 4

Keys to Regression Success

  • Feature transformation + linear model
  • Penalized weights to avoid overfitting

5

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

Can fit linear functions to nonlinear features

6

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) = θ0 + θ1xi + θ2x2

i + θ3x3 i

A nonlinear function of x: Can be written as a linear function of

“Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data

φ(xi) = [1 xi x2

i x3 i ]

ˆ y(xi) =

4

X

g=1

θgφg(xi) = θT φ(xi)

slide-6
SLIDE 6

What feature transform to use?

  • Anything that works for your data!
  • sin / cos for periodic data
  • polynomials for high-order dependencies
  • interactions between feature dimensions
  • Many other choices possible

7

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xi) = [1 xi x2

i . . .]

φ(xi) = [1 xi1xi2 xi3xi4 . . .]

slide-7
SLIDE 7

8

Mike Hughes - Tufts COMP 135 - Spring 2019

Review: Linear Regression

Training: Solve optimization problem Prediction: Linear transform of G-dim features

+ L2 penalty (optional)

ˆ y(xi, θ) = θT φ(xi) =

G

X

g=1

θg · φ(xi)g min

θ N

X

n=1

(yn − ˆ y(xn, θ))2

slide-8
SLIDE 8

Problems with high-dim features

  • Feature transformation + linear model

9

Mike Hughes - Tufts COMP 135 - Spring 2019

How expensive is this transformation? (Runtime and storage)

slide-9
SLIDE 9

Thought Experiment

  • Suppose that the optimal weight vector can be

exactly constructed via a linear combination of the training set feature vectors

10

Mike Hughes - Tufts COMP 135 - Spring 2019

θ∗ = α1φ(x1) + α2φ(x2) + . . . + αNφ(xN)

Each alpha is a scalar Each feature vector is a vector of size G

slide-10
SLIDE 10

Justification?

Is optimal theta a linear combo of feature vectors?

Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form

  • Starting with all zero vector
  • In each step, adding a weight * feature vector

Each update step:

11

Mike Hughes - Tufts COMP 135 - Spring 2019

θt ← θt−1 − η · d dθloss(yn, θT φ(xn))

Let’s simplify this via chain rule!

slide-11
SLIDE 11

Justification?

Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form

  • Starting with all zero vector
  • In each step, adding a weight * feature vector

Each update step:

12

Mike Hughes - Tufts COMP 135 - Spring 2019

θt ← θt−1 − η · d daloss(yn, a) · d dθθT φ(xn)

scalar scalar Vector of size G

slide-12
SLIDE 12

Justification?

Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form

  • Starting with all zero vector
  • In each step, adding a weight * feature vector

Each update step:

13

Mike Hughes - Tufts COMP 135 - Spring 2019

scalar scalar Vector of size G (simplified)

θt ← θt−1 − η · d daloss(yn, a) · φ(xn)

slide-13
SLIDE 13

How to Predict in this thought experiment

14

Mike Hughes - Tufts COMP 135 - Spring 2019

θ∗ = α1φ(x1) + α2φ(x2) + . . . + αNφ(xN)

Prediction: ˆ y(xi, θ∗) = N X

n=1

αnφ(xn) !T φ(xi)

ˆ y(xi, θ) = θT φ(xi) =

slide-14
SLIDE 14

How to Predict in this thought experiment

15

Mike Hughes - Tufts COMP 135 - Spring 2019

θ∗ = α1φ(x1) + α2φ(x2) + . . . + αNφ(xN)

Prediction:

ˆ y(xi, θ) = θT φ(xi) =

ˆ y(xi, θ∗) =

N

X

n=1

αnφ(xn)T φ(xi)

Inner product

  • f test feature vector

with each training feature!

slide-15
SLIDE 15

Kernel Function

16

Mike Hughes - Tufts COMP 135 - Spring 2019

k(xi, xj) = φ(xi)T φ(xj)

Input: any two vectors xi and xj Output: scalar real Interpretation: similarity function for xi and xj Properties: Larger output values mean i and j are more similar Symmetric

slide-16
SLIDE 16

Kernelized Linear Regression

17

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Prediction:
  • Training

ˆ y(xi, α, {xn}N

n=1) = N

X

n=1

αnk(xn, xi)

min

α N

X

n=1

(yn − ˆ y(xn, α, X))2

= X

Can do all needed operations with only access to kernel (no feature vectors)

slide-17
SLIDE 17

18

Mike Hughes - Tufts COMP 135 - Spring 2019

Compare: Linear Regression

Training: Solve optimization problem Prediction: Linear transform of G-dim features

+ L2 penalty (optional)

ˆ y(xi, θ) = θT φ(xi) =

G

X

g=1

θg · φ(xi)g min

θ N

X

n=1

(yn − ˆ y(xn, θ))2

slide-18
SLIDE 18

Why is kernel trick good idea?

Before: Training problem optimized vector of size G Prediction cost: scales linearly with G (num. high-dim features) After: Training problem optimized vector of size N Prediction cost: scales linearly with N (num. train examples) requires N evaluations of kernel So we get some saving in runtime/storage if G is bigger than N AND we can compute k faster than inner product

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

Example: From Features to Kernels

20

Mike Hughes - Tufts COMP 135 - Spring 2019

k(x, z) = (1 + x1z1 + x2z2)2 k(x, z) φ(x)T φ(z)

Compare: What is relationship between these two functions defined above?

x = [x1 x2] φ(x) = [1 x2

1 x2 2

√ 2x1 √ 2x2 √ 2x1x2]

z = [z1 z2]

slide-20
SLIDE 20

21

Mike Hughes - Tufts COMP 135 - Spring 2019

k(x, z) = (1 + x1z1 + x2z2)2

=

x = [x1 x2] φ(x) = [1 x2

1 x2 2

√ 2x1 √ 2x2 √ 2x1x2]

z = [z1 z2]

Punchline: Can sometimes find faster ways to compute high-dim. inner product

k(x, z) φ(x)T φ(z)

Compare: What is relationship between these two functions defined above?

Example: From Features to Kernels

slide-21
SLIDE 21

Cost comparison

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Compare: Number of add and multiply ops to compute Number of add and multiply ops to compute

k(x, z) = (1 + x1z1 + x2z2)2 k(x, z) φ(x)T φ(z) x = [x1 x2] φ(x) = [1 x2

1 x2 2

√ 2x1 √ 2x2 √ 2x1x2]

z = [z1 z2]

slide-22
SLIDE 22

Example:

Kernel cheaper than inner product

23

Mike Hughes - Tufts COMP 135 - Spring 2019

Compare: Number of add and multiply ops to compute Number of add and multiply ops to compute

k(x, z) = (1 + x1z1 + x2z2)2 k(x, z) φ(x)T φ(z)

6 multiply and 5 add 3 multiply (include square) and 2 add

x = [x1 x2] φ(x) = [1 x2

1 x2 2

√ 2x1 √ 2x2 √ 2x1x2]

z = [z1 z2]

slide-23
SLIDE 23

Squared Exponential Kernel

24

Mike Hughes - Tufts COMP 135 - Spring 2019

k(x, z) = e−(x−z)2

Assume x is a scalar

max at x = z

x

Also called “radial basis function (RBF)” kernel

z

slide-24
SLIDE 24

Squared Exponential Kernel

25

Mike Hughes - Tufts COMP 135 - Spring 2019

k(x, z) = e−(x−z)2 = e−x2−z2+2xz = e−x2e−z2e2xz

Assume x is a scalar

slide-25
SLIDE 25

Recall: Taylor series for e^x

26

Mike Hughes - Tufts COMP 135 - Spring 2019

ex =

X

k=0

1 k!xk = 1 + x + 1 2x2 + . . . e2xz =

X

k=0

2k k! xkzk

slide-26
SLIDE 26

Squared Exponential Kernel

27

Mike Hughes - Tufts COMP 135 - Spring 2019

k(x, z) = e−(x−z)2 = e−x2−z2+2xz

z) = e−x2e−z2 ∞ X

k=0

r 2k k! xk ! ∞ X

k=0

r 2k k! zk ! = φ(x)T φ(z)

φ(x) = [ r 20 0! x0e−x2 r 21 1! x1e−x2 . . . r 2k k! xke−x2 . . . ]

Corresponds to an INFINITE DIMENSIONAL feature vector

slide-27
SLIDE 27

Kernelized Regression Demo

28

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-28
SLIDE 28

Linear Regression

29

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-29
SLIDE 29
  • K : N x N symmetric matrix

30

Mike Hughes - Tufts COMP 135 - Spring 2019

Kernel Matrix for training set

K =      k(x1, x1) k(x1, x2) . . . k(x1, xN) k(x2, x1) k(x2, x2) . . . k(x2, xN) . . . k(xN, x1) k(xN, x2) . . . k(xN, xN)     

slide-30
SLIDE 30

31

Mike Hughes - Tufts COMP 135 - Spring 2019

100 training examples in x_train 505 test examples in x_test

Linear Regression with Kernel

slide-31
SLIDE 31

Linear Regression with Kernel

32

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-32
SLIDE 32

Polynomial Kernel, deg. 5

33

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-33
SLIDE 33

Polynomial Kernel, deg. 12

34

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

Gaussian kernel (aka sq. exp.)

35

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-35
SLIDE 35

Kernel Regression in sklearn

36

Mike Hughes - Tufts COMP 135 - Spring 2019 Demo will use kernel=‘precomputed’

slide-36
SLIDE 36

Can kernelize any linear model

37

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi, α, {xn}N

n=1) = N

X

n=1

αnk(xn, xi)

Regression: Prediction Logistic Regression: Prediction

p(Yi = 1|xi) = σ(ˆ y(xi, α, X))

slide-37
SLIDE 37

Training for kernelized versions of * Linear Regression * Logistic Regression

38

Mike Hughes - Tufts COMP 135 - Spring 2019

min

α N

X

n=1

(yn − ˆ y(xn, α, X))2 min

α N

X

n=1

log loss(yn, σ(ˆ y(xn, α, X)))

slide-38
SLIDE 38

SVMs: Prediction

39

Mike Hughes - Tufts COMP 135 - Spring 2019

Make binary prediction via hard threshold

( 1 if ˆ y(xi) ≥ 0

  • therwise

ˆ y(xi) = wT xi + b

slide-39
SLIDE 39

SVMs and Kernels: Prediction

40

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) =

N

X

n=1

αnk(xn, xi)

Make binary prediction via hard threshold

( 1 if ˆ y(xi) ≥ 0

  • therwise

Efficient training algorithms using modern quadratic programming solve the dual optimization problem of SVM soft margin problem

slide-40
SLIDE 40

Support vectors are often small fraction of all examples

41

Mike Hughes - Tufts COMP 135 - Spring 2019

Nearest negative example Nearest positive example

x+ x−

slide-41
SLIDE 41

Support vectors defined by non-zero alpha in kernel view

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-42
SLIDE 42

SVM + Squared Exponential Kernel

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

Kernel Unit Objectives

44

Mike Hughes - Tufts COMP 135 - Spring 2019

Big idea: Use kernel functions (similarity function with special properties) to obtain flexible high-dimensional feature transformations without explicit features

  • From linear regression (LR) to kernelized LR
  • What is a kernel function?
  • Basic properties
  • Example: Polynomial kernel
  • Example: Squared Exponential kernel
  • Kernels for classification
  • Logistic Regression
  • SVMs