BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick - - PowerPoint PPT Presentation

bbm406
SMART_READER_LITE
LIVE PREVIEW

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick - - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss Support Vector Regression Aykut Erdem // Hacettepe University // Fall 2019


slide-1
SLIDE 1

Aykut Erdem // Hacettepe University // Fall 2019

Lecture 17:

Kernel Trick for SVMs Risk and Loss Support Vector Regression

BBM406

Fundamentals of 
 Machine Learning

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20

slide-2
SLIDE 2

Administrative

  • Project progress reports are 


due soon!


Due: December 22, 2019 (11:59pm) Each group should submit a project progress report by December 22, 2018. The report should be 3-4 pages and should describe the following points as clearly as possible:

  • Problem to be addressed. Give a short description of the problem that you

will explore. Explain why you find it interesting.

  • Related work. Briefly review the major works related to your research topic.
  • Methodology to be employed. Describe the neural architecture that is

expected to form the basis of the project. State whether you will extend an existing method or you are going to devise your own approach.

  • Experimental evaluation. Briefly explain how you will evaluate your results.

State which dataset(s) you will employ in your evaluation. Provide your preliminary results (if any).

2

Deadlines are much closer than they appear

  • n syllabus
slide-3
SLIDE 3

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert)
 Finding the minimum error separating hyperplane is NP hard

minimum error separator is impossible

Last time… Soft-margin Classifier

slide by Alex Smola
slide-4
SLIDE 4

hw, xi + b  1 + ξ

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

Last time… Adding Slack Variables

slide by Alex Smola

ξi ≥ 0

slide-5
SLIDE 5

hw, xi + b  1 + ξ

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

Last time… Adding Slack Variables

  • for point is between the margin and correctly

classified

  • for point is misclassified

ξi ≥ 0 0 < ξ ≤ 1

adopted from Andrew Zisserman
slide-6
SLIDE 6
  • Hard margin problem
  • With slack variables



 


Problem is always feasible. Proof: 
 (also yields upper bound)

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1 minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1

Last time… Adding Slack Variables

slide by Alex Smola
slide-7
SLIDE 7
  • Optimization problem:



 


minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

Soft-margin classifier

adopted from Andrew Zisserman

C is a regularization parameter:

  • small C allows constraints to be easily ignored 


→ large margin

  • large C makes constraints hard to ignore


→ narrow margin

  • C = ∞ enforces all constraints: hard margin
slide-8
SLIDE 8

Last time… Multi-class SVM

8

  • Simultaneously-learn-3-sets--
  • f-weights:--
  • How-do-we-guarantee-the--

correct-labels?--

  • Need-new-constraints!--

The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:--

w+ w- wo

slide by Eric Xing
slide-9
SLIDE 9

Last time… Multi-class SVM

9

  • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:##

To predict, we use:

Now#can#we#learn#it?###

? 

slide by Eric Xing
slide-10
SLIDE 10

Last time… Kernels

  • Original data
  • Data in feature space (implicit)
  • Solve in feature space using kernels

10

slide by Alex Smola
slide-11
SLIDE 11

Last time… Quadratic Features

11

Quadratic Features in R2 Φ(x) := ⇣ x2

1,

p 2x1x2, x2

2

⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2

1,

p 2x1x2, x2

2

⌘ , ⇣ x0

1 2,

p 2x0

1x0 2, x0 2 2⌘E

= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id.

Quadratic Features in

Dot Product

Trick works for any polynomials of order

Insight

via

slide by Alex Smola
slide-12
SLIDE 12

Computational Efficiency

12

Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products

  • implicitly. For some features this works . . .

Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .

Solution

Definition

5 · 105

slide by Alex Smola
slide-13
SLIDE 13

Last time.. Example kernels

13

Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

  • λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)

  • Cond. Expectation

Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.

slide by Alex Smola
slide-14
SLIDE 14

Today

  • The Kernel Trick for SVMs
  • Risk and Loss
  • Support Vector Regression

14

slide-15
SLIDE 15

The Kernel Trick for SVMs

slide by Alex Smola
slide-16
SLIDE 16
  • Linear soft margin problem
  • Dual problem



 


  • Support vector expansion

f(x) = X

i

αiyi hxi, xi + b maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

The Kernel Trick for SVMs

slide by Alex Smola
slide-17
SLIDE 17

f(x) = X

i

αiyik(xi, x) + b maximize

α

− 1 2 X

i,j

αiαjyiyjk(xi, xj) + X

i

αi subject to X

i

αiyi = 0 and αi ∈ [0, C]

  • Linear soft margin problem
  • Dual problem



 


  • Support vector expansion

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0

The Kernel Trick for SVMs

slide by Alex Smola
slide-18
SLIDE 18

C=1

slide by Alex Smola
slide-19
SLIDE 19

C=1

slide by Alex Smola

support 
 vectors support 
 vectors y=0 y = 1 y = -1

slide-20
SLIDE 20

C=2

slide by Alex Smola
slide-21
SLIDE 21

C=5

slide by Alex Smola
slide-22
SLIDE 22

C=10

slide by Alex Smola
slide-23
SLIDE 23

C=20

slide by Alex Smola
slide-24
SLIDE 24

C=50

slide by Alex Smola
slide-25
SLIDE 25

C=100

slide by Alex Smola
slide-26
SLIDE 26

C=1

slide by Alex Smola
slide-27
SLIDE 27

C=2

slide by Alex Smola
slide-28
SLIDE 28

C=5

slide by Alex Smola
slide-29
SLIDE 29

C=10

slide by Alex Smola
slide-30
SLIDE 30

C=20

slide by Alex Smola
slide-31
SLIDE 31

C=50

slide by Alex Smola
slide-32
SLIDE 32

C=100

slide by Alex Smola
slide-33
SLIDE 33

C=1

slide by Alex Smola
slide-34
SLIDE 34

C=2

slide by Alex Smola
slide-35
SLIDE 35

C=5

slide by Alex Smola
slide-36
SLIDE 36

C=10

slide by Alex Smola
slide-37
SLIDE 37

C=20

slide by Alex Smola
slide-38
SLIDE 38

C=50

slide by Alex Smola
slide-39
SLIDE 39

C=100

slide by Alex Smola
slide-40
SLIDE 40

C=1

slide by Alex Smola
slide-41
SLIDE 41

C=2

slide by Alex Smola
slide-42
SLIDE 42

C=5

slide by Alex Smola
slide-43
SLIDE 43

C=10

slide by Alex Smola
slide-44
SLIDE 44

C=20

slide by Alex Smola
slide-45
SLIDE 45

C=50

slide by Alex Smola
slide-46
SLIDE 46

C=100

slide by Alex Smola
slide-47
SLIDE 47

And now with a narrower kernel

slide by Alex Smola
slide-48
SLIDE 48 slide by Alex Smola
slide-49
SLIDE 49 slide by Alex Smola
slide-50
SLIDE 50 slide by Alex Smola
slide-51
SLIDE 51 slide by Alex Smola
slide-52
SLIDE 52

And now with a very wide kernel

slide by Alex Smola
slide-53
SLIDE 53 slide by Alex Smola
slide-54
SLIDE 54
  • Increasing C allows for more nonlinearities
  • Decreases number of errors
  • SV boundary need not be contiguous
  • Kernel width adjusts function class

Nonlinear Separation

slide by Alex Smola
slide-55
SLIDE 55

Overfitting?

  • Huge feature space with kernels: should we

worry about overfitting?


  • SVM objective seeks a solution with large margin
  • Theory says that large margin leads to good

generalization (we will see this in a couple of lectures)


  • But everything overfits sometimes!!! 

  • Can control by:
  • Setting C
  • Choosing a better Kernel
  • Varying parameters of the Kernel (width of Gaussian,

etc.)

55

slide by Alex Smola
slide-56
SLIDE 56

Risk and Loss

56

slide by Alex Smola
slide-57
SLIDE 57
  • Constrained quadratic program



 


  • Risk minimization setting



 
 
 Follows from finding minimal slack variable for given (w,b) pair.

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize

w,b

1 2 kwk2 + C X

i

max [0, 1 yi [hw, xii + b]]

empirical risk

Loss function point of view

slide by Alex Smola
slide-58
SLIDE 58
  • Soft margin loss
  • Binary loss

max(0, 1 − yf(x)) {yf(x) < 0}

convex upper bound binary loss function margin

Soft margin as proxy for binary

slide by Alex Smola
slide-59
SLIDE 59
  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

More loss functions

slide by Alex Smola
slide-60
SLIDE 60

R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m

m

X

i=1

{yif(xi) > 0} Rreg[f] := 1 m

m

X

i=1

max(0, 1 − yif(xi)) + λΩ[f]

regularization how to control ƛ

Risk minimization view

  • Find function f minimizing classification error
  • Compute empirical average


− Minimization is nonconvex − Overfitting as we minimize empirical error

  • Compute convex upper bound on the loss
  • Add regularization for capacity control



 


slide by Alex Smola
slide-61
SLIDE 61

Support Vector 
 Regression

61

slide-62
SLIDE 62

Regression Estimation

  • Find function f minimizing regression error
  • Compute empirical average



 
 Overfitting as we minimize empirical error

  • Add regularization for capacity control



 


62

R[f] := Ex,y∼p(x,y) [l(y, f(x))] Remp[f] := 1 m

m

X

i=1

l(yi, f(xi)) Rreg[f] := 1 m

m

X

i=1

l(yi, f(xi)) + λΩ[f]

slide by Alex Smola
slide-63
SLIDE 63

Squared loss

63

l(y, f(x)) = 1 2(y − f(x))2

slide by Alex Smola
slide-64
SLIDE 64

l1 loss

64

l(y, f(x)) = |y − f(x)|

slide by Alex Smola
slide-65
SLIDE 65

ε-insensitive Loss

65

l(y, f(x)) = max(0, |y − f(x)| − ✏)

slide by Alex Smola

allow some deviation without a penalty

slide-66
SLIDE 66

Penalized least mean squares

  • Optimization problem


  • Solution

66

∂w [. . .] = 1 m

m

X

i=1

⇥ xix>

i w − xiyi

⇤ + λw =  1 mXX> + λ1

  • w − 1

mXy = 0 hence w = ⇥ XX> + λm1 ⇤1 Xy

Conjugate Gradient Sherman Morrison Woodbury

Outer product matrix in X

minimize

w

1 2m

m

X

i=1

(yi hxi, wi)2 + λ 2 kwk2

slide by Alex Smola
slide-67
SLIDE 67

Penalized least mean squares ... now with kernels

  • Optimization problem


  • Representer Theorem (Kimeldorf & Wahba, 1971)

67

minimize

w

1 2m

m

X

i=1

(yi hφ(xi), wi)2 + λ 2 kwk2 kwk2 =

  • wk
  • 2 + kw?k2

empirical risk dependent

w⊥ wk

slide by Alex Smola
slide-68
SLIDE 68
  • Optimization problem


  • Representer Theorem (Kimeldorf & Wahba, 1971)
  • Optimal solution is in span of data
  • Proof - risk term only depends on data via
  • Regularization ensures that orthogonal part is 0
  • Optimization problem in terms of w



 
 solve for as linear system

φ(xi) w = X

i

αiφ(xi) minimize

α

1 2m

m

X

i=1

⇣ yi − X

j

Kijαj ⌘2 + λ 2 X

i,j

αiαjKij minimize

w

1 2m

m

X

i=1

(yi hφ(xi), wi)2 + λ 2 kwk2 α = (K + mλ1)−1y

Penalized least mean squares ... now with kernels

slide-69
SLIDE 69
  • Optimization problem


  • Representer Theorem (Kimeldorf & Wahba, 1971)
  • Optimal solution is in span of data
  • Proof - risk term only depends on data via
  • Regularization ensures that orthogonal part is 0
  • Optimization problem in terms of w



 
 solve for as linear system

φ(xi) w = X

i

αiφ(xi) minimize

α

1 2m

m

X

i=1

⇣ yi − X

j

Kijαj ⌘2 + λ 2 X

i,j

αiαjKij minimize

w

1 2m

m

X

i=1

(yi hφ(xi), wi)2 + λ 2 kwk2 α = (K + mλ1)−1y

Penalized least mean squares ... now with kernels

slide-70
SLIDE 70

70

x x x x x x x x x x x x x x

+ε −ε

x

ξ +ε −ε ξ

y x y − f(x) loss

don’t care about deviations within the tube

slide by Alex Smola

SVM Regression (ϵ-insensitive loss)

slide-71
SLIDE 71
  • Optimization Problem (as constrained QP)
  • Lagrange Function

71

minimize

w,b

1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ]

subject to hw, xii + b  yi + ✏ + ⇠i and ⇠i 0 hw, xii + b yi ✏ ⇠∗

i and ⇠∗ i 0

L =1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ] m

X

i=1

[⌘i⇠i + ⌘∗

i ⇠∗ i ] + m

X

i=1

↵i [hw, xii + b yi ✏ ⇠i] +

m

X

i=1

↵∗

i [yi ✏ ⇠∗ i hw, xii b]

slide by Alex Smola

SVM Regression (ϵ-insensitive loss)

slide-72
SLIDE 72
  • First order conditions
  • Dual problem

72

∂wL = 0 = w + X

i

[αi − α∗

i ] xi

∂bL = 0 = X

i

[αi − α∗

i ]

∂ξiL = 0 = C − ηi − αi ∂ξ∗

i L = 0 = C − η∗

i − α∗ i

minimize

α,α∗

1 2(↵ − ↵⇤)>K(↵ − ↵⇤) + ✏1>(↵ + ↵⇤) + y>(↵ − ↵⇤) subject to 1>(↵ − ↵⇤) = 0 and ↵i, ↵⇤

i ∈ [0, C]

slide by Alex Smola

SVM Regression (ϵ-insensitive loss)

slide-73
SLIDE 73

Properties

  • Ignores ‘typical’ instances with small error
  • Only upper or lower bound active at any

time

  • QP in 2n variables as cheap as SVM

problem

  • Robustness with respect to outliers
  • l1 loss yields same problem without epsilon
  • Huber’s robust loss yields similar problem

but with added quadratic penalty on coefficients

73

slide by Alex Smola
slide-74
SLIDE 74

Regression example

74

sinc x + 0.1 sinc x - 0.1 approximation

slide by Alex Smola
slide-75
SLIDE 75

Regression example

75

sinc x + 0.2 sinc x - 0.2 approximation

slide by Alex Smola
slide-76
SLIDE 76

Regression example

76

sinc x + 0.5 sinc x - 0.5 approximation

slide by Alex Smola
slide-77
SLIDE 77

Regression example

77

Support Vectors Support Vectors Support Vectors

slide by Alex Smola
slide-78
SLIDE 78

Huber’s robust loss

78

quadratic linear

l(y, f(x)) = (

1 2(y − f(x))2

if |y − f(x)| < 1 |y − f(x)| − 1

2

  • therwise

trimmed mean estimatior

slide by Alex Smola
slide-79
SLIDE 79

Summary

  • Advantages:
  • Kernels allow very flexible hypotheses
  • Poly-time exact optimization methods rather than

approximate methods

  • Soft-margin extension permits mis-classified

examples

  • Variable-sized hypothesis space
  • Excellent results (1.1% error rate on handwritten

digits vs. LeNet’s 0.9%)

  • Disadvantages:
  • Must choose kernel parameters
  • Very large problems computationally intractable
  • Batch algorithm

79

slide by Sanja Fidler
slide-80
SLIDE 80

Software

  • SVMlight: one of the most widely used SVM packages.

Fast optimization, can handle very large datasets, C++ code.

  • LIBSVM
  • Both of these handle multi-class, weighted SVM for

unbalanced data, etc.

  • There are several new approaches to solving the SVM
  • bjective that can be much faster:
  • Stochastic subgradient method (discussed in a few

lectures)

  • Distributed computation (also to be discussed)
  • See http://mloss.org, “machine learning open

source software”

80

slide by Alex Smola
slide-81
SLIDE 81

Next Lecture:

Decision Trees

81