Introduction to Machine Learning Non-linear prediction with kernels - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Non-linear prediction with kernels - - PowerPoint PPT Presentation

Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Solving non-linear classification tasks How can we find nonlinear classification boundaries? Similar as in


slide-1
SLIDE 1

Introduction to Machine Learning

Non-linear prediction with kernels

  • Prof. Andreas Krause

Learning and Adaptive Systems (las.ethz.ch)

slide-2
SLIDE 2

Solving non-linear classification tasks

How can we find nonlinear classification boundaries? Similar as in regression, can use non-linear transformations of the feature vectors, followed by linear classification

2

slide-3
SLIDE 3

Avoiding the feature explosion

Need O(dk) dimensions to represent (multivariate) polynomials of degree k on d features Example: d=10000, k=2 è Need ~100M dimensions In the following, we can see how we can efficiently implicitly operate in such high-dimensional feature spaces (i.e., without ever explicitly computing the transformation)

3

slide-4
SLIDE 4

The „Kernel Trick“

Express problem s.t. it only depends on inner products Replace inner products by kernels Example: Perceptron Will see further examples later

4

ˆ α = arg min

α1:n

1 n

n

X

i=1

max{0, −

n

X

j=1

αjyiyjxT

i xj}

ˆ α = arg min

α1:n

1 n

n

X

i=1

max{0, −

n

X

j=1

αjyiyjk(xi, xj)}

slide-5
SLIDE 5

Kernelized Perceptron

Initialize For t=1,2,...

Pick data point (xi, ,yi) uniformly at random Predict If set

For new point x, predict

5

ˆ y = sign ⇣ n X

j=1

αjyjk(xj, xi) ⌘ ˆ y 6= yi αi ← αi + ηt α1 = · · · = αn = 0 Training Prediction ˆ y = sign ⇣ n X

j=1

αjyjk(xj, x) ⌘

slide-6
SLIDE 6

Questions

What are valid kernels? How can we select a good kernel for our problem? Can we use kernels beyond the perceptron? Kernels work in very high-dimensional spaces. Doesn‘t this lead to overfitting?

6

slide-7
SLIDE 7

Definition: kernel functions

Data space X A kernel is a function satisfying 1) Symmetry: For any it must hold that 2) Positive semi-definiteness: For any n, any set , the kernel (Gram) matrix must be positive semi-definite

7

k : X × X → R S = {x1, . . . , xn} ⊆ X K =    k(x1, x1) . . . k(x1, xn) . . . . . . k(xn, x1) . . . k(xn, xn)    x, x0 ∈ X k(x, x0) = k(x0, x)

slide-8
SLIDE 8

Examples of kernels on

Linear kernel: Polynomial kernel: Gaussian (RBF, squared exp. kernel): Laplacian kernel:

8

Rd

k(x, x0) = xT x0 k(x, x0) = (xT x0 + 1)d k(x, x0) = exp(−||x − x0||2

2/h2)

k(x, x0) = exp(−||x − x0||1/h)

slide-9
SLIDE 9

Examples of (non)-kernels

9

k(x, x0) = sin(x) cos(x0) k(x, x0) = xT Mx0

slide-10
SLIDE 10

Effect of kernel on function class

Given kernel k, predictors (for kernelized classification) have the form

10

ˆ y = sign ⇣ n X

j=1

αjyjk(xj, x) ⌘

slide-11
SLIDE 11

Example: Gaussian kernel

11

0.2 0.4 0.6 0.8 1

  • 4
  • 3
  • 2
  • 1

1 2

Bandwidth h=.1

100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.4 0.6 0.8 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 3

Bandwidth h=.3 k(x, x0) = exp(−||x − x0||2

2/h2)

f(x) =

n

X

i=1

αik(xi, x)

slide-12
SLIDE 12

12

Example: Laplace/Exponential kernel

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Bandwidth h=1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5

Bandwidth h=.3

100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

f(x) =

n

X

i=1

αik(xi, x)

slide-13
SLIDE 13

Demo: Effect on decision boundary

13

slide-14
SLIDE 14

Kernels beyond

Can define kernels on a variety of objects: Sequence kernels Graph kernels Diffusion kernels Kernels on probability distributions ...

14

Rd

slide-15
SLIDE 15

Example: Graph kernels

Can define a kernel for measuring similarity between graphs by comparing random walks on both graphs (not further defined here)

15

[Borgwardt et al.]

slide-16
SLIDE 16

Example: Diffusion kernels on graphs

Can measure similarity among nodes in a graph via diffusion kernels (not defined here)

16

s1 s2 s3 s4 s5 s7 s6 s11 s12 s9 s10 s8 s1 s3 s12 s9

K = exp(−βL)

slide-17
SLIDE 17

Kernel engineering (composition rules)

Suppose we have two kernels defined on data space X Then the following functions are valid kernels: where f is a polynomial with positive coefficients or the exponential function

17

k1 : X × X → R k2 : X × X → R k(x, x0) = k1(x, x0) + k2(x, x0) k(x, x0) = c k1(x, x0) for c > 0 k(x, x0) = k1(x, x0) k2(x, x0) k(x, x0) = f(k1(x, x0))

slide-18
SLIDE 18

Example: ANOVA kernel

18

slide-19
SLIDE 19

Example: Modeling pairwise data

19

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −10 −8 −6 −4 −2 2 4 6 8

Actions Contexts Payoffs

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −3 −2 −1 1 2 3

Actions Contexts Payoffs

May want to use kernels to model pairwise data (users x products; genes x patients; ...)

slide-20
SLIDE 20

Where are we?

We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next questions:

What kind of predictors / decision boundaries do kernel methods entail? Can we use the kernel trick beyond the perceptron?

20

slide-21
SLIDE 21

Kernels as similarity functions

Recall Perceptron (and SVM) classification rule: Consider Gaussian kernel

21

y = sign n X

i=1

αiyik(xi, x) !

k(x, x0) = exp(−||x − x0||2/h2)

slide-22
SLIDE 22

Side note: Nearest-neighbor classifiers

For data point x, predict majority of labels of k nearest neighbors

22

y = sign n X

i=1

yi[xi among k nearest neighbors of x] !

slide-23
SLIDE 23

Demo: k-NN

23

slide-24
SLIDE 24

Nearest-neighbor classifiers

For data point x, predict majority of labels of k nearest neighbors How to choose k?

Cross-validation! J

24

y = sign n X

i=1

yi[xi among k nearest neighbors of x] !

slide-25
SLIDE 25

K-NN vs. Kernel Perceptron

k-Nearest Neighbor: Kernel Perceptron:

25

y = sign n X

i=1

yiαik(xi, x) !

y = sign n X

i=1

yi[xi among k nearest neighbors of x] !

slide-26
SLIDE 26

Comparison: k-NN vs Kernelized Perceptron

26

Method k-NN Kernelized Perceptron Advantages No training necessary Optimized weights can lead to improved performance Can capture „global trends“ with suitable kernels Depends on „wrongly classified“ examples only Disadvantages Depends on all data è inefficient Training requires

  • ptimization
slide-27
SLIDE 27

Parametric vs nonparametric learning

Parametric models have finite set of parameters Example: Linear regression, linear Perceptron, ... Nonparametric models grow in complexity with the size of the data

Potentially much more expressive But also more computationally complex – Why?

Example: Kernelized Perceptron, k-NN, ... Kernels provide a principled way of deriving non- parametric models from parametric ones

27

slide-28
SLIDE 28

Where are we?

We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next question:

Can we use the kernel trick beyond the perceptron?

28

slide-29
SLIDE 29

Kernelized SVM

The support vector machine can also be kernelized

29

ˆ w = arg min

w

1 n

n

X

i=1

max{0, 1 − yiwT xi} + λ||w||2

2

slide-30
SLIDE 30

How to kernelize the objective?

30

ˆ w = arg min

w

1 n

n

X

i=1

max{0, 1 − yiwT xi} + λ||w||2

2

slide-31
SLIDE 31

How to kernelize the regularizer?

31

ˆ w = arg min

w

1 n

n

X

i=1

max{0, 1 − yiwT xi} + λ||w||2

2

slide-32
SLIDE 32

Learning & prediction with kernel classifier

Learning: Solve the problem Prediction: For data point x predict label y as

32

ki = [y1k(xi, x1), . . . , ynk(xi, xn)]

Per- ceptron: SVM: Or:

arg min

α

1 n

n

X

i=1

max{0, 1 − yiαT ki} + λαT DyKDyα

arg min

α

1 n

n

X

i=1

max{0, − yiαT ki}

ˆ y = sign n X

i=1

αiyik(xi, x) !

slide-33
SLIDE 33

Demo: Kernelized SVM

33

slide-34
SLIDE 34

Kernelized Linear Regression

From linear to nonlinear regression: Can also kernelize linear regression Predictor has the form

34

f(x) =

n

X

i=1

αik(xi, x) + + + + + + + + + x f(x) + + ++ + + + + + + + + + + + + x f(x)

slide-35
SLIDE 35

Example: Kernelized linear regression

Original (parametric) linear optimization problem Similar as in perceptron, optimal lies in span of data:

35

ˆ w = arg min

w

1 n

n

X

i=1

⇣ wT xi − yi ⌘2 + λ||w||2

2

ˆ w =

n

X

i=1

αixi

slide-36
SLIDE 36

Kernelizing linear regression

36

ˆ w = arg min

w

1 n

n

X

i=1

⇣ wT xi − yi ⌘2 + λ||w||2

2

ˆ w =

n

X

i=1

αixi

slide-37
SLIDE 37

Kernelized linear regression

37

K =    k(x1, x1) . . . k(x1, xn) . . . . . . k(xn, x1) . . . k(xn, xn)   

ˆ α = arg min

α1:n

1 n

n

X

i=1

⇣ n X

j=1

αjk(xi, xj) − yi ⌘2 + λαT Kα

slide-38
SLIDE 38

Learning & Predicting with KLR

Learning: Solve least squares problem Closed-form solution: Prediction: For data point x predict response y as

38

ˆ y =

n

X

i=1

ˆ αik(xi, x) ˆ α = arg min

α

1 n||αT K − y||2

2 + λαT Kα

ˆ α = (K + nλI)−1y

slide-39
SLIDE 39

Demo: Kernelized linear regression

39

slide-40
SLIDE 40

KLR for the linear kernel

What if ?

40

k(x, x0) = xT x0

slide-41
SLIDE 41

Application: semi-parametric regression

Often, parametric models are too „rigid“, and non- parametric models fail to extrapolate Solution: Use additive combination of linear and non- linear kernel function

41

k(x, x0) = c1 exp(||x − x0||2

2/h2) + c2xT x0

slide-42
SLIDE 42

Demo: Semi-parametric KLR

42

slide-43
SLIDE 43

Example

43

slide-44
SLIDE 44

Example fits

44

slide-45
SLIDE 45

Application: Designing P450s chimeras

[with Phil Romero, Frances Arnold PNAS‘13]

45

slide-46
SLIDE 46

Design space

46

Parent sequences Candidate designs

ABC 1 2 3 ... n

slide-47
SLIDE 47

Protein Fitness Landscape

47

Thermostability x x x

slide-48
SLIDE 48

Application: Protein Engineering

[with Romero, Arnold, PNAS ‘13]

48

slide-49
SLIDE 49

Wet-lab results

[w Romero, Arnold PNAS ’13] Identification of new thermostable P450s chimera 5.3C more stable than best published sequence!

49

slide-50
SLIDE 50

Choosing kernels

For a given kernel, how should we choose parameters?

Cross-validation! J

How should we select suitable kernels?

Domain knowledge (dependent on data type) «Brute force» (or heuristic) search Use cross-validation

Learning kernels

Much research on automatically selecting good kernels (Multiple Kernel Learning; Hyperkernels; etc.)

50

slide-51
SLIDE 51

Parameter demo

51

slide-52
SLIDE 52

What about overfitting?

Kernels map to (very) high-dimensional spaces. Why do we hope to be able to learn? First attempt of an answer: (typically) # parameters << # dimensions. Why? Number of parameters = number of data points („non-parametric learning“)

52

slide-53
SLIDE 53

What about overfitting?

Kernels map to (very) high-dimensional spaces. Why do we hope to be able to learn? Second attempt of an answer: Overfitting can of course happen (if we choose poor parameters) Can combat overfitting by regularization

This is already built into kernelized linear regression (and SVMs), but not the kernelized Perceptron

53

KLR: SVM:

ˆ α = arg min

α

1 n||αT K − y||2

2 + λαT Kα

ˆ α = arg min

α

1 n

n

X

i=1

max{0, 1 − yiαT ki} + λαT DyKDyα

slide-54
SLIDE 54

What you need to know

Kernels are

(efficient, implicit) inner products Positive (semi-)definite functions Many examples (linear, polynomial, Gaussian/RBF, ...)

The „Kernel trick“

Reformulate learning algorithm so that inner products appear Replace inner products by kernels

K-Nearest Neighbor classifier (and relation to Perceptron) How to choose kernels (kernel engineering etc.) Applications: Kernelized Perceptron / SVM; kernelized linear regression

54

slide-55
SLIDE 55

55

Supervised learning big picture so far

Least squares Regression Perceptron Ridge Regression Linear SVM Kernelized Regression Kernelized SVM

l2-regularizer l2-regularizer K e r n e l s Kernels Loss funct. Loss funct.

k-NN

„ S p e c i a l c a s e “

Kernelized Perceptron

Kernels Loss funct.

Lasso

l1-regular.

l1-SVM

l1-regular.

slide-56
SLIDE 56

Supervised learning summary so far

56

Model/

  • bjective:

Loss-function + Regularization

Squared loss, 0/1 loss, Perceptron loss, Hinge loss L2 norm, L1 norm

Method:

Exact solution, Gradient Descent, (mini-batch) SGD, Convex Programming, …

Model selection:

K-fold Cross-Validation, Monte Carlo CV

Representation/ features

Linear hypotheses; nonlinear hypotheses with nonlinear feature transforms; kernels

Evaluation metric:

Mean squared error, Accuracy