CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - - PowerPoint PPT Presentation

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia About Online Lectures Course Information Update Record the lectures and upload the videos on Collab


slide-1
SLIDE 1

CS 6316 Machine Learning

Support Vector Machines and Kernel Meth-

  • ds

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

About Online Lectures

slide-3
SLIDE 3

Course Information Update

◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question

◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message

2

slide-4
SLIDE 4

Course Information Update

◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question

◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message

◮ Slack: as a stable communication channel to

◮ send out instant messages if my network connection

is unreliable

◮ online discussion

2

slide-5
SLIDE 5

Course Information Update

◮ Homework

◮ Subject to change

3

slide-6
SLIDE 6

Course Information Update

◮ Homework

◮ Subject to change

◮ Final project

◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it

with me

3

slide-7
SLIDE 7

Course Information Update

◮ Homework

◮ Subject to change

◮ Final project

◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it

with me

◮ Office hour

◮ Wednesday 11 AM: I will be on Zoom ◮ You can also send me an email or Slack message

anytime

3

slide-8
SLIDE 8

Separable Cases

slide-9
SLIDE 9

Geometric Margin

The geometric margin of a linear binary classifier h(x) w, x + b at a point x is its distance to the hyper-plane w, x 0 ρh(x) |w, x + b| w2 (1)

5

slide-10
SLIDE 10

Geometric Margin (II)

The geometric margin of h(x) for a set of examples T {x1, . . . , xm} is the minimal distance over these examples ρh(T) min

x′∈T ρh(x′)

(2) [Mohri et al., 2018, Page 80]

6

slide-11
SLIDE 11

Half-Space Hypothesis Space

◮ Training set S {(x1, y1), . . . , (xm, ym)} with xi ∈ Rd

and yi ∈ {+1, −1}

◮ If the training set is linearly separable

yi(w, xi + b) > 0

∀i ∈ [m]

(3)

◮ Linearly separable cases

◮ Existence of equation 3 ◮ All halfspace predictors that satisfy the condition in

equation 3 are ERM hypotheses

7

slide-12
SLIDE 12

Which Hypothesis is Better?

[Shalev-Shwartz and Ben-David, 2014, Page 203]

8

slide-13
SLIDE 13

Which Hypothesis is Better?

◮ Intuitively, a hypothesis with larger margin is better,

because it is more robust to noise

◮ Final definition of margin will be provided later

[Shalev-Shwartz and Ben-David, 2014, Page 203]

8

slide-14
SLIDE 14

Hard SVM/Separable Cases

The mathematical formulation of the previous idea ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (4) s.t. yi(w, xi + b) > 0

∀i

(5)

◮ yi(w, xi + b) > 0 ∀i: guarantee (w, b) is an ERM

hypothesis

9

slide-15
SLIDE 15

Hard SVM/Separable Cases

The mathematical formulation of the previous idea ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (4) s.t. yi(w, xi + b) > 0

∀i

(5)

◮ yi(w, xi + b) > 0 ∀i: guarantee (w, b) is an ERM

hypothesis

◮ mini∈[m]: calculate the margin between a hyper-plane

and a set of examples

9

slide-16
SLIDE 16

Hard SVM/Separable Cases

The mathematical formulation of the previous idea ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (4) s.t. yi(w, xi + b) > 0

∀i

(5)

◮ yi(w, xi + b) > 0 ∀i: guarantee (w, b) is an ERM

hypothesis

◮ mini∈[m]: calculate the margin between a hyper-plane

and a set of examples

◮ max(w,b): maximize the margin

9

slide-17
SLIDE 17

Illustration

Original form ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (6) s.t. yi(w, xi + b) > 0

∀i

(7)

10

slide-18
SLIDE 18

Alternative Forms

◮ Original form

ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (8) s.t. yi(w, xi + b) > 0

∀i

(9)

11

slide-19
SLIDE 19

Alternative Forms

◮ Original form

ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (8) s.t. yi(w, xi + b) > 0

∀i

(9)

◮ Alternative form 1

ρ

  • max

(w,b) min i∈[m]

yi(w, xi + b) w2 (10)

11

slide-20
SLIDE 20

Alternative Forms

◮ Original form

ρ

  • max

(w,b) min i∈[m]

|w, xi + b| w2 (8) s.t. yi(w, xi + b) > 0

∀i

(9)

◮ Alternative form 1

ρ

  • max

(w,b) min i∈[m]

yi(w, xi + b) w2 (10)

◮ Alternative form 2

ρ

  • max

(w,b): mini∈[m] yi(w,xi+b1

1 w2 (11)

  • max

(w,b): yi(w,xi+b≥1

1 w2 (12)

11

slide-21
SLIDE 21

Alternative Forms (II)

◮ Alternative form 2

ρ max

(w,b): yi(w,xi+b≥1

1 w2 (13)

◮ Alternative form 3: Quadratic programming (QP)

min

(w,b)

1 2w2

2

s.t. yi(w, xi + b) ≥ 1,

∀i ∈ [m]

(14) which is a constrained optimization problem that can be solved by standard QP packages

12

slide-22
SLIDE 22

Alternative Forms (II)

◮ Alternative form 2

ρ max

(w,b): yi(w,xi+b≥1

1 w2 (13)

◮ Alternative form 3: Quadratic programming (QP)

min

(w,b)

1 2w2

2

s.t. yi(w, xi + b) ≥ 1,

∀i ∈ [m]

(14) which is a constrained optimization problem that can be solved by standard QP packages

◮ Exercise: Solve a SVM problem with quadratic

programming

12

slide-23
SLIDE 23

Unconstrained Optimization Problem

The quadratic programming problem with constraints can be converted to an unconstrained optimization problem with the Lagrangian method L(w, b, α) 1 2 w2

2 − m

  • i1

αi(yi(w, xi + b) − 1) (15) where

◮ α {α1, . . . , αm} is the Lagrange multiplier, and ◮ αi ≥ 0 is associated with the i-th training example

13

slide-24
SLIDE 24

Constrained Optimization Problems

slide-25
SLIDE 25

Constrained Optimization Problems: Definition

◮ X ⊆ Rd and ◮ f , gi : X → R, ∀i ∈ [m]

Then, a constrained optimization problem is defined in the form of min

x∈X

f (x) (16) s.t. gi(x) ≤ 0, ∀i ∈ [m] (17)

15

slide-26
SLIDE 26

Constrained Optimization Problems: Definition

◮ X ⊆ Rd and ◮ f , gi : X → R, ∀i ∈ [m]

Then, a constrained optimization problem is defined in the form of min

x∈X

f (x) (16) s.t. gi(x) ≤ 0, ∀i ∈ [m] (17) Comments

◮ In general definition, x is the target variable for

  • ptimization

◮ Special cases of gi(x): (1) gi(x) 0, (2) gi(x) ≥ 0, and

(3) gi(x) ≤ b

15

slide-27
SLIDE 27

Lagrangian

The Lagrangian associated to the general constrained

  • ptimization problem defined in equation 16 – 17 is the

function defined over X× Rm

+ as

L(x, α) f (x) +

m

  • i1

αi gi(x) (18) where

◮ α (α1, . . . , αm) ∈ Rm

+

◮ αi ≥ 0 for any i ∈ [m]

16

slide-28
SLIDE 28

Karush-Kuhn-Tucker’s Theorem

Assume that f , gi : X → R, ∀i ∈ [m] are convex and differentiable and that the constraints are qualified. Then x′ is a solution of the constrained problem if and only if there exist α′ ≥ 0 such that ∇xL(x′, α′)

  • ∇x f (x′) + α′ · ∇x g(x) 0

(19) ∇αL(x, α)

  • g(x′) ≤ 0

(20) α′ · g(x′)

  • m
  • i1

α′

i gi(x′) 0

(21) Equations 19 – 21 are called KKT conditions [Mohri et al., 2018, Thm B.30]

17

slide-29
SLIDE 29

KKT in SVM

Apply the KKT conditions to the SVM problem L(w, b, α) 1 2 w2

2 − m

  • i1

αi(yi(w, xi + b) − 1) (22) We have ∇wL w −

m

  • i1

αi yixi 0 ⇒ w

m

  • i1

αi yixi

18

slide-30
SLIDE 30

KKT in SVM

Apply the KKT conditions to the SVM problem L(w, b, α) 1 2 w2

2 − m

  • i1

αi(yi(w, xi + b) − 1) (22) We have ∇wL w −

m

  • i1

αi yixi 0 ⇒ w

m

  • i1

αi yixi ∇bL −

m

  • i1

αi yi 0 ⇒

m

  • i1

αi yi 0

18

slide-31
SLIDE 31

KKT in SVM

Apply the KKT conditions to the SVM problem L(w, b, α) 1 2 w2

2 − m

  • i1

αi(yi(w, xi + b) − 1) (22) We have ∇wL w −

m

  • i1

αi yixi 0 ⇒ w

m

  • i1

αi yixi ∇bL −

m

  • i1

αi yi 0 ⇒

m

  • i1

αi yi 0

∀i, αi(yi(w, xi + b) − 1) 0

⇒ αi 0 or yi(w, xi + b) 1

18

slide-32
SLIDE 32

Support Vectors

Consider the implication of the last equation in the previous page, ∀i

◮ αi > 0 and

yi(w, xi + b) 1 or

19

slide-33
SLIDE 33

Support Vectors

Consider the implication of the last equation in the previous page, ∀i

◮ αi > 0 and

yi(w, xi + b) 1 or

◮ αi 0 and

yi(w, xi + b) ≥ 1

19

slide-34
SLIDE 34

Support Vectors

Consider the implication of the last equation in the previous page, ∀i

◮ αi > 0 and

yi(w, xi + b) 1 or

◮ αi 0 and

yi(w, xi + b) ≥ 1 w

m

  • i1

αi yixi (23)

◮ Examples with αi > 0 are called support vectors ◮ In Rd, d + 1 examples are sufficient to define a

hyper-plane

19

slide-35
SLIDE 35

Non-separable Cases

slide-36
SLIDE 36

Non-separable Cases

Recall the separable case: min

(w,b)

1 2 w2

2

s.t. yi(w, xi + b) ≥ 1,

∀i ∈ [m]

(24)

21

slide-37
SLIDE 37

Non-separable Cases

Recall the separable case: min

(w,b)

1 2 w2

2

s.t. yi(w, xi + b) ≥ 1,

∀i ∈ [m]

(24) For non-separable cases, there always exists an xi, such that yi(w, xi + b) 1 (25)

  • r, we can formulate it as

yi(w, xi + b) ≥ 1 − ξi (26) with ξi ≥ 0

21

slide-38
SLIDE 38

Geometric Meaning of ξi

Consider the relaxed constraint yi(w, xi + b) ≥ 1 − ξi (27) and three cases of ξi

◮ ξi 0 ◮ 0 < ξi < 1 ◮ ξi ≥ 1

22

slide-39
SLIDE 39

Non-separable Cases (II)

In general, the SVM problem of non-separable cases can be formulated as min

(w,b)

1 2w2

2 + C m

  • i1

ξp

i

s.t. yi(w, xi + b) ≥ 1 − ξi,

∀i ∈ [m]

ξi ≥ 0 (28) where C ≥ 0, p ≥ 1, and {ξi}m

i1 ≥ 0 are known as slack

variables and are commonly used in optimization to define relaxed versions of constraints.

23

slide-40
SLIDE 40

Lagrangian

Follows the same procedure as the separable cases, the Lagrangian is defined as L(w, b, ξ, α, β) 1 2w2

2 + C m

  • i1

ξi −

m

  • i1

αi(yi(wTxi + b) − 1 + ξi) −

m

  • i1

βiξi (29) with αi, βi ≥ 0

24

slide-41
SLIDE 41

Lagrangian

Follows the same procedure as the separable cases, the Lagrangian is defined as L(w, b, ξ, α, β) 1 2w2

2 + C m

  • i1

ξi −

m

  • i1

αi(yi(wTxi + b) − 1 + ξi) −

m

  • i1

βiξi (29) with αi, βi ≥ 0 Exercise: show the KKT conditions of equation 29

24

slide-42
SLIDE 42

Support Vectors

The first two equations in the KKT conditions are similar to the separable cases, and the rest are αi + βi

  • C

(30) αi 0

  • r

yi(wTxi + b) 1 − ξi (31) βi 0

  • r

ξi 0 (32) Depending the value of ξi, there are two types of support vectors

◮ ξi 0: βi ≥ 0 and 0 < αi ≤ C

◮ xi may lie on the marginal hyper-planes (as in the

separable case)

25

slide-43
SLIDE 43

Support Vectors

The first two equations in the KKT conditions are similar to the separable cases, and the rest are αi + βi

  • C

(30) αi 0

  • r

yi(wTxi + b) 1 − ξi (31) βi 0

  • r

ξi 0 (32) Depending the value of ξi, there are two types of support vectors

◮ ξi 0: βi ≥ 0 and 0 < αi ≤ C

◮ xi may lie on the marginal hyper-planes (as in the

separable case)

◮ ξi > 0: βi 0 and αi C

◮ xi is an outlier

25

slide-44
SLIDE 44

Support Vectors (II)

Two types of support vectors

◮ αi C: xi is an outlier ◮ 0 < αi < C: xi lies on the marginal hyper-planes

26

slide-45
SLIDE 45

Dual Optimization Problem

slide-46
SLIDE 46

Lagrangian

Combine the Lagrangian L

  • 1

2 w2

2 − m

  • i1

αi[yi(w, xi + b) − 1]

  • 1

2 w2

2 − m

  • i1

αi yiw, xi − b

m

  • i1

αi yi +

m

  • i1

αi

28

slide-47
SLIDE 47

Lagrangian

Combine the Lagrangian L

  • 1

2 w2

2 − m

  • i1

αi[yi(w, xi + b) − 1]

  • 1

2 w2

2 − m

  • i1

αi yiw, xi − b

m

  • i1

αi yi +

m

  • i1

αi with some of the KKT conditions w

  • m
  • i1

αi yixi (33)

m

  • i1

αi yi

  • 0,

(34) we have ...

28

slide-48
SLIDE 48

Dual Problem

L 1 2

m

  • i1

αi yixi2

2 − m

  • i1

m

  • j1

αiαj yi yjxi, xj − b

m

  • i1

αi yi

  • +

m

  • i1

αi (35)

29

slide-49
SLIDE 49

Dual Problem

L 1 2

m

  • i1

αi yixi2

2 − m

  • i1

m

  • j1

αiαj yi yjxi, xj − b

m

  • i1

αi yi

  • +

m

  • i1

αi (35) Given m

i1 αi yixi2 2 m i1

m

j1 αiαj yi yjxi, xj, we

have L −1 2

m

  • i1

m

  • j1

αiαj yi yjxi, xj +

m

  • i1

αi (36)

29

slide-50
SLIDE 50

Dual Problem (II)

The dual optimization problem for SVMs of the separable cases is max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj (37) s.t. αi ≥ 0 (38)

m

  • i1

αi yi 0 ∀i ∈ [m] (39)

30

slide-51
SLIDE 51

Dual Problem (II)

The dual optimization problem for SVMs of the separable cases is max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj (37) s.t. αi ≥ 0 (38)

m

  • i1

αi yi 0 ∀i ∈ [m] (39)

◮ Lagrange multiplier α is also called dual variable ◮ This is an optimization problem only about α

30

slide-52
SLIDE 52

Dual Problem (II)

The dual optimization problem for SVMs of the separable cases is max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj (37) s.t. αi ≥ 0 (38)

m

  • i1

αi yi 0 ∀i ∈ [m] (39)

◮ Lagrange multiplier α is also called dual variable ◮ This is an optimization problem only about α ◮ The dual problem is defined on the inner product

xi, xj

30

slide-53
SLIDE 53

Primal and Dual Problem

◮ Primal problem

min

(w,b)

1 2w2

2

s.t. yi(w, xi + b) ≥ 1,

∀i ∈ [m]

(40)

◮ Dual problem

max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj s.t.

m

  • i1

αi yi 0 and αi ≥ 0 ∀i ∈ [m] (41)

◮ These two problems are equivalent

[Boyd and Vandenberghe, 2004, Chapter 5]

31

slide-54
SLIDE 54

SVM Hypothesis, revisited

Once we solve the dual problem with α, we have the solution of w as w

m

  • i1

αi yixi (42) and the hypothesis h(x) as h(x)

  • sign(w, x + b)

(43) (45)

32

slide-55
SLIDE 55

SVM Hypothesis, revisited

Once we solve the dual problem with α, we have the solution of w as w

m

  • i1

αi yixi (42) and the hypothesis h(x) as h(x)

  • sign(w, x + b)

(43)

  • sign(

m

  • i1

αi yixi, x + b) (44) (45)

32

slide-56
SLIDE 56

SVM Hypothesis, revisited

Once we solve the dual problem with α, we have the solution of w as w

m

  • i1

αi yixi (42) and the hypothesis h(x) as h(x)

  • sign(w, x + b)

(43)

  • sign(

m

  • i1

αi yixi, x + b) (44)

  • sign(

m

  • i1

αi yixi, x + b) (45)

32

slide-57
SLIDE 57

SVM Hypothesis, revisited

Once we solve the dual problem with α, we have the solution of w as w

m

  • i1

αi yixi (42) and the hypothesis h(x) as h(x)

  • sign(w, x + b)

(43)

  • sign(

m

  • i1

αi yixi, x + b) (44)

  • sign(

m

  • i1

αi yixi, x + b) (45) Exercise: Prove b yi − m

i1 αi yixi, x for any xi with

αi > 0

32

slide-58
SLIDE 58

Kernel Methods

slide-59
SLIDE 59

Properties of Inner Product

In the solution of SVMs h(x) sign(

m

  • i1

αi yixi, x + b) b yi −

m

  • i1

αi yixi, x (46)

34

slide-60
SLIDE 60

Properties of Inner Product

In the solution of SVMs h(x) sign(

m

  • i1

αi yixi, x + b) b yi −

m

  • i1

αi yixi, x (46) Extend the capacity of SVMs by replacing the inner product xi, x with a kernel function K(xi, x) Φ(xi), Φ(x) (47) where Φ(·) is a nonlinear mapping function.

34

slide-61
SLIDE 61

Examples: Polynomial Kernels

For any constant c > 0, a polynomial kernel of degree d ∈ N is the kernel K defined over Rn by K(x, x′) (x, x′ + c)d, ∀x, x′ ∈ Rn (48)

35

slide-62
SLIDE 62

Examples: Polynomial Kernels

For any constant c > 0, a polynomial kernel of degree d ∈ N is the kernel K defined over Rn by K(x, x′) (x, x′ + c)d, ∀x, x′ ∈ Rn (48) Special cases

◮ d 1: K(x, x′) x, x′ + c ◮ d 2: K(x, x′) (x, x′ + c)2

35

slide-63
SLIDE 63

Examples: Polynomial Kernels (II)

For the special case with d 2, assume x, x′ ∈ R2

K(x, x′)

  • (x, x′ + c)2

(49)

  • (x1x′

1 + x2x′ 2 + c)2

(50)

  • x2

1x′2 1 + x1x2x′ 1x′ 2 + cx1x′ 1 + x1x2x′ 1x′ 2

+x2

2x′2 2 + cx2x′ 2 + cx1x′ 1 + cx2x′ 2 + c2

(51)

36

slide-64
SLIDE 64

Examples: Polynomial Kernels (II)

For the special case with d 2, assume x, x′ ∈ R2

K(x, x′)

  • (x, x′ + c)2

(49)

  • (x1x′

1 + x2x′ 2 + c)2

(50)

  • x2

1x′2 1 + x1x2x′ 1x′ 2 + cx1x′ 1 + x1x2x′ 1x′ 2

+x2

2x′2 2 + cx2x′ 2 + cx1x′ 1 + cx2x′ 2 + c2

(51)

  • x2

1x′2 1 + x2 2x′2 2 + 2x1x′1x2x′2

(52) +2cx1x′1 + 2cx2x′2 + c2 (53)

36

slide-65
SLIDE 65

Examples: Polynomial Kernels (II)

For the special case with d 2, assume x, x′ ∈ R2

K(x, x′)

  • (x, x′ + c)2

(49)

  • (x1x′

1 + x2x′ 2 + c)2

(50)

  • x2

1x′2 1 + x1x2x′ 1x′ 2 + cx1x′ 1 + x1x2x′ 1x′ 2

+x2

2x′2 2 + cx2x′ 2 + cx1x′ 1 + cx2x′ 2 + c2

(51)

  • x2

1x′2 1 + x2 2x′2 2 + 2x1x′1x2x′2

(52) +2cx1x′1 + 2cx2x′2 + c2 (53)

  • [x2

1, x2 2,

√ 2x1x2, √ 2cx1, √ 2cx2, c]

            

x′2

1

x′2

2

√ 2x′1x′2 √ 2cx′1 √ 2cx′2 c

            

36

slide-66
SLIDE 66

Examples: Polynomial Kernels (III)

Let K(x, x′) Φ(x), Φ(x′), then Φ(x) [x2

1, x2 2,

√ 2x1x2, √ 2cx1, √ 2cx2, c] (54) which maps a 2-D data point x into a 6-D space as Φ(x)

37

slide-67
SLIDE 67

Examples: Polynomial Kernels (III)

Let K(x, x′) Φ(x), Φ(x′), then Φ(x) [x2

1, x2 2,

√ 2x1x2, √ 2cx1, √ 2cx2, c] (54) which maps a 2-D data point x into a 6-D space as Φ(x) Recall the XOR problem

37

slide-68
SLIDE 68

Examples: Polynomial Kernels (III)

Let K(x, x′) Φ(x), Φ(x′), then Φ(x) [x2

1, x2 2,

√ 2x1x2, √ 2cx1, √ 2cx2, c] (54) which maps a 2-D data point x into a 6-D space as Φ(x) Recall the XOR problem

37

slide-69
SLIDE 69

Gaussian Kernels

For any constant σ > 0, a Gaussian kernel or radial basis function (RBF) is the kernel K defined over Rd by K(x, x′) exp

x′ − x2

2

2σ2

  • (55)

x1 x2

38

slide-70
SLIDE 70

Gaussian Kernels

For any constant σ > 0, a Gaussian kernel or radial basis function (RBF) is the kernel K defined over Rd by K(x, x′) exp

x′ − x2

2

2σ2

  • (55)

x1 x2 Question: What Φ(x) looks like in this case?

38

slide-71
SLIDE 71

SVMs with Kernel Functions

◮ Problem definition

max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjK(xi, xj) s.t. αi ≥ 0 and

m

  • i1

αi yi 0, i ∈ [m] (56)

39

slide-72
SLIDE 72

SVMs with Kernel Functions

◮ Problem definition

max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjK(xi, xj) s.t. αi ≥ 0 and

m

  • i1

αi yi 0, i ∈ [m] (56)

◮ Solution: separable case

h(x) sign

m

  • i1

αi yiK(xi, x) + b

  • (57)

with b yi − m

j1 αj yjK(xj, xi) for any xi with αi > 0

39

slide-73
SLIDE 73

The Choice of Kernels

◮ The choice of K(x, x′) can be arbitrary, as long as the

existence of Φ(·) is guaranteed ◮ For many cases, Φ(·) cannot be found explicitly [Mohri et al., 2018, Section 6.1 - 6.2]

40

slide-74
SLIDE 74

The Choice of Kernels

◮ The choice of K(x, x′) can be arbitrary, as long as the

existence of Φ(·) is guaranteed ◮ For many cases, Φ(·) cannot be found explicitly

◮ Alternatively, we only need to make sure K(x, x′) is

positive definite symmetric (PDS) ◮ A kernel K is PDS if for any {x1, . . . , xm} the matrix K

is symmetric positive semi-definite K [K(xi, xj)]i,j ∈ Rm×m (58)

[Mohri et al., 2018, Section 6.1 - 6.2]

40

slide-75
SLIDE 75

The Choice of Kernels

◮ The choice of K(x, x′) can be arbitrary, as long as the

existence of Φ(·) is guaranteed ◮ For many cases, Φ(·) cannot be found explicitly

◮ Alternatively, we only need to make sure K(x, x′) is

positive definite symmetric (PDS) ◮ A kernel K is PDS if for any {x1, . . . , xm} the matrix K

is symmetric positive semi-definite K [K(xi, xj)]i,j ∈ Rm×m (58)

◮ A symmetric positive semi-definite matrix is defined

as cTKc ≥ 0 (59) [Mohri et al., 2018, Section 6.1 - 6.2]

40

slide-76
SLIDE 76

Reference

Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press. Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

41