Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

machine learning chenhao tan
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 10 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 52 Roadmap Last time: linear SVM formulation when data is


slide-1
SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 10 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 52

slide-2
SLIDE 2

Roadmap

  • Last time: linear SVM formulation when data is linearly separable
  • This time:
  • Introduce duality
  • Make linear SVM work when data is not linearly separable
  • Introduce an efficient algorithm for finding weights
  • Next time: Kernel trick

Machine Learning: Chenhao Tan | Boulder | 2 of 52

slide-3
SLIDE 3

Overview

Duality Slack variables Sequential Mimimal Optimization Recap

Machine Learning: Chenhao Tan | Boulder | 3 of 52

slide-4
SLIDE 4

Duality

Outline

Duality Slack variables Sequential Mimimal Optimization Recap

Machine Learning: Chenhao Tan | Boulder | 4 of 52

slide-5
SLIDE 5

Duality

Binary classification

Given: Strain = {(xi, yi)}m

i=1 training examples, xi ∈ Rd, yi ∈ {−1, 1}

Goal: Find hypothesis function h : X → Y Linear SVM: learn a linear decision rule of the form w · x + b

Machine Learning: Chenhao Tan | Boulder | 5 of 52

slide-6
SLIDE 6

Duality

Optimizing the objective function

min

w,b

1 2||w||2 (1) subject to yi(w · xi + b) ≥ 1, i ∈ [1, m]

Machine Learning: Chenhao Tan | Boulder | 6 of 52

slide-7
SLIDE 7

Duality

Optimizing Constrained Functions

The Method of Lagrange Multipliers

Constrained problem (Primal problem)

min

x

f(x) s.t. gi(x) ≥ 0, i ∈ [1, n]

Lagrange Multiplier

L (x, α) = f(x) −

n

  • i=1

αigi(x), αi ≥ 0, i ∈ [1, n]

Machine Learning: Chenhao Tan | Boulder | 7 of 52

slide-8
SLIDE 8

Duality

Lagrange Multiplier

p∗: the optimal value in the primal problem We claim that p∗ = min

x

max

α

L (x, α) = min

x

max

α

f(x) −

n

  • i=1

αigi(x)

Machine Learning: Chenhao Tan | Boulder | 8 of 52

slide-9
SLIDE 9

Duality

Lagrange Multiplier

p∗: the optimal value in the primal problem We claim that p∗ = min

x

max

α

L (x, α) = min

x

max

α

f(x) −

n

  • i=1

αigi(x) This is because max −αy =

  • y ≥ 0

+∞

  • therwise

Machine Learning: Chenhao Tan | Boulder | 8 of 52

slide-10
SLIDE 10

Duality

Lagrange Multiplier

What happens if we reverse min and max: max

α

min

x

L (x, α) ≥ or ≤ min

x

max

α

L (x, α)

Machine Learning: Chenhao Tan | Boulder | 9 of 52

slide-11
SLIDE 11

Duality

Lagrange Multiplier

What happens if we reverse min and max: max

α

min

x

L (x, α) ≤ min

x

max

α

L (x, α) The left leads to the dual problem.

Machine Learning: Chenhao Tan | Boulder | 9 of 52

slide-12
SLIDE 12

Duality

Primal vs. Dual

Prime problem

min

w,b

1 2||w||2 s.t. yi(w · xi + b) ≥ 1, i ∈ [1, m] Derive the function for dual problem.

Machine Learning: Chenhao Tan | Boulder | 10 of 52

slide-13
SLIDE 13

Duality

Primal vs. Dual

Prime problem

min

w,b

1 2||w||2 s.t. yi(w · xi + b) ≥ 1, i ∈ [1, m] Replace w, b with stationarity conditions.

Machine Learning: Chenhao Tan | Boulder | 10 of 52

slide-14
SLIDE 14

Duality

Primal vs. Dual

Primal problem

min

w,b

1 2||w||2 s.t. yi(w · xi + b) ≥ 1, i ∈ [1, m]

Dual problem

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 11 of 52

slide-15
SLIDE 15

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

slide-16
SLIDE 16

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =

m

  • i=1

αiyixi,

m

  • i=1

αiyi = 0 (3)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

slide-17
SLIDE 17

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =

m

  • i=1

αiyixi,

m

  • i=1

αiyi = 0 (3)

Complementary slackness

αi = 0 ∨ yi(w · xi + b) = 1 (4)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

slide-18
SLIDE 18

Slack variables

Outline

Duality Slack variables Sequential Mimimal Optimization Recap

Machine Learning: Chenhao Tan | Boulder | 13 of 52

slide-19
SLIDE 19

Slack variables

Old objective function

min

w,b

1 2||w||2 (5) subject to yi(w · xi + b) ≥ 1, i ∈ [1, m]

Machine Learning: Chenhao Tan | Boulder | 14 of 52

slide-20
SLIDE 20

Slack variables

Can SVMs Work Here?

Machine Learning: Chenhao Tan | Boulder | 15 of 52

slide-21
SLIDE 21

Slack variables

Can SVMs Work Here?

yi(w · xi + b) ≥ 1 (6)

Machine Learning: Chenhao Tan | Boulder | 15 of 52

slide-22
SLIDE 22

Slack variables

Trick: Allow for a few bad apples

Machine Learning: Chenhao Tan | Boulder | 16 of 52

slide-23
SLIDE 23

Slack variables

Old objective function

min

w,b

1 2||w||2 (7) subject to yi(w · xi + b) ≥ 1, i ∈ [1, m]

Machine Learning: Chenhao Tan | Boulder | 17 of 52

slide-24
SLIDE 24

Slack variables

Relaxing the constraint

yi(w · xi + b) ≥ 1 − ξi

  • ξi = 0 means at least one margin on correct side of decision boundary
  • ξi = 1/2 means at least one-half margin on correct side of decision boundary
  • ξi = 2 means at least one margin on wrong side of decision boundary

Machine Learning: Chenhao Tan | Boulder | 18 of 52

slide-25
SLIDE 25

Slack variables

New objective function

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξip (8) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

Machine Learning: Chenhao Tan | Boulder | 19 of 52

slide-26
SLIDE 26

Slack variables

New objective function

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξip (8) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

  • Standard margin

Machine Learning: Chenhao Tan | Boulder | 19 of 52

slide-27
SLIDE 27

Slack variables

New objective function

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξip (8) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

  • Standard margin
  • How wrong a point is (slack variables)

Machine Learning: Chenhao Tan | Boulder | 19 of 52

slide-28
SLIDE 28

Slack variables

New objective function

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξip (8) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

  • Standard margin
  • How wrong a point is (slack variables)
  • Tradeoff between margin and slack variables

Machine Learning: Chenhao Tan | Boulder | 19 of 52

slide-29
SLIDE 29

Slack variables

New objective function

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξip (8) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

  • Standard margin
  • How wrong a point is (slack variables)
  • Tradeoff between margin and slack variables
  • How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

slide-30
SLIDE 30

Slack variables

Aside: Loss Functions

  • Losses measure how bad a mistake is
  • Important for slack as well

Machine Learning: Chenhao Tan | Boulder | 20 of 52

slide-31
SLIDE 31

Slack variables

Aside: Loss Functions

  • Losses measure how bad a mistake is
  • Important for slack as well

x 0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

slide-32
SLIDE 32

Slack variables

Aside: Loss Functions

  • Losses measure how bad a mistake is
  • Important for slack as well

x Linear Hinge 0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

slide-33
SLIDE 33

Slack variables

Aside: Loss Functions

  • Losses measure how bad a mistake is
  • Important for slack as well

x Quadratic Hinge Linear Hinge 0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

slide-34
SLIDE 34

Slack variables

Aside: Loss Functions

  • Losses measure how bad a mistake is
  • Important for slack as well

x Quadratic Hinge Linear Hinge 0/1 Loss

We’ll focus on linear hinge loss, set p = 1

Machine Learning: Chenhao Tan | Boulder | 20 of 52

slide-35
SLIDE 35

Slack variables

What is the role of C?

min

w,b,ξ

1 2||w||2 + C

  • i=1

ξi (9) subject to yi(w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m]

  • A. C ↑⇒ low bias, low variance
  • B. C ↑⇒ low bias, high variance
  • C. C ↑⇒ high bias, low variance
  • D. C ↑⇒ high bias, high variance

Machine Learning: Chenhao Tan | Boulder | 21 of 52

slide-36
SLIDE 36

Slack variables

New Lagrangian

L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi (10) −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] (11) −

m

  • i=1

βiξi (12)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

slide-37
SLIDE 37

Slack variables

New Lagrangian

L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi (10) −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] (11) −

m

  • i=1

βiξi (12) Taking the gradients (∇wL , ∇bL , ∇ξiL ) and solving for zero gives us w =

m

  • i=1

αiyixi (13)

m

  • i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

slide-38
SLIDE 38

Slack variables

New Lagrangian

L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi (10) −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] (11) −

m

  • i=1

βiξi (12) Taking the gradients (∇wL , ∇bL , ∇ξiL ) and solving for zero gives us w =

m

  • i=1

αiyixi (13)

m

  • i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

slide-39
SLIDE 39

Slack variables

New Lagrangian

L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi (10) −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] (11) −

m

  • i=1

βiξi (12) Taking the gradients (∇wL , ∇bL , ∇ξiL ) and solving for zero gives us w =

m

  • i=1

αiyixi (13)

m

  • i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

slide-40
SLIDE 40

Slack variables

New Lagrangian

L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi (10) −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] (11) −

m

  • i=1

βiξi (12) Taking the gradients (∇wL , ∇bL , ∇ξiL ) and solving for zero gives us w =

m

  • i=1

αiyixi (13)

m

  • i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

slide-41
SLIDE 41

Slack variables

Simplifying dual objective

w =

m

  • i=1

αiyixi

m

  • i=1

αiyi = 0 αi + βi = C

Machine Learning: Chenhao Tan | Boulder | 23 of 52

slide-42
SLIDE 42

Slack variables

Simplifying dual objective

w =

m

  • i=1

αiyixi

m

  • i=1

αiyi = 0 αi + βi = C L (w, b, ξ, α, β) =1 2||w||2 + C

m

  • i=1

ξi −

m

  • i=1

αi [yi(w · xi + b) − 1 + ξi] −

m

  • i=1

βiξi

Machine Learning: Chenhao Tan | Boulder | 23 of 52

slide-43
SLIDE 43

Slack variables

Dual Problem

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. C ≥ αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 24 of 52

slide-44
SLIDE 44

Slack variables

Dual Problem

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. C ≥ αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 24 of 52

slide-45
SLIDE 45

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1 − ξi, ξi ≥ 0, C ≥ αi ≥ 0, βi ≥ 0 (16)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

slide-46
SLIDE 46

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1 − ξi, ξi ≥ 0, C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =

m

  • i=1

αiyixi,

m

  • i=1

αiyi = 0, αi + βi = C (17)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

slide-47
SLIDE 47

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1 − ξi, ξi ≥ 0, C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =

m

  • i=1

αiyixi,

m

  • i=1

αiyi = 0, αi + βi = C (17)

Complementary slackness

αi[yi(w · xi + b) − 1 + ξi] = 0, βiξi = 0 (18)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

slide-48
SLIDE 48

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b) − 1 + ξi] = 0, βiξi = 0 (19)

  • xi satisfies the margin, yi(w · xi + b) > 1 ⇒ αi = 0

Machine Learning: Chenhao Tan | Boulder | 26 of 52

slide-49
SLIDE 49

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b) − 1 + ξi] = 0, βiξi = 0 (19)

  • xi satisfies the margin, yi(w · xi + b) > 1 ⇒ αi = 0
  • xi does not satisfy the margin, yi(w · xi + b) < 1 ⇒ αi = C

Machine Learning: Chenhao Tan | Boulder | 26 of 52

slide-50
SLIDE 50

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b) − 1 + ξi] = 0, βiξi = 0 (19)

  • xi satisfies the margin, yi(w · xi + b) > 1 ⇒ αi = 0
  • xi does not satisfy the margin, yi(w · xi + b) < 1 ⇒ αi = C
  • xi is on the margin, yi(w · xi + b) = 1 ⇒ 0 ≤ αi ≤ C

Machine Learning: Chenhao Tan | Boulder | 26 of 52

slide-51
SLIDE 51

Sequential Mimimal Optimization

Outline

Duality Slack variables Sequential Mimimal Optimization Recap

Machine Learning: Chenhao Tan | Boulder | 27 of 52

slide-52
SLIDE 52

Sequential Mimimal Optimization

Sequential Mimimal Optimization

Trivia

  • Invented by John Platt in 1998 at Microsoft Research
  • Called Minimal due to solving small sub-problems

Machine Learning: Chenhao Tan | Boulder | 28 of 52

slide-53
SLIDE 53

Sequential Mimimal Optimization

Dual problem

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. C ≥αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 29 of 52

slide-54
SLIDE 54

Sequential Mimimal Optimization

Brief Interlude: Coordinate Ascent

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. C ≥αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0 Loop over each training example, change αi to maximize the above function

Machine Learning: Chenhao Tan | Boulder | 30 of 52

slide-55
SLIDE 55

Sequential Mimimal Optimization

Brief Interlude: Coordinate Ascent

max

α m

  • i=1

αi − 1 2

m

  • i=1

m

  • j=1

αiαjyiyj(xj · xi) s.t. C ≥αi ≥ 0, i ∈ [1, m]

  • i

αiyi = 0 Loop over each training example, change αi to maximize the above function Although coordinate ascent works OK for lots of problems, we have the constraint

  • i αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 30 of 52

slide-56
SLIDE 56

Sequential Mimimal Optimization

Outline for SVM Optimization (SMO)

  • 1. Select two examples i, j
  • 2. Update αj, αi to maximize the above function

Machine Learning: Chenhao Tan | Boulder | 31 of 52

slide-57
SLIDE 57

Sequential Mimimal Optimization

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1 − ξi, ξi ≥ 0, C ≥ αi ≥ 0, βi ≥ 0 (20)

Stationarity

w =

m

  • i=1

αiyixi,

m

  • i=1

αiyi = 0, αi + βi = C (21)

Complementary slackness

αi[yi(w · xi + b) − 1 + ξi] = 0, βiξi = 0 (22)

Machine Learning: Chenhao Tan | Boulder | 32 of 52

slide-58
SLIDE 58

Sequential Mimimal Optimization

Outline for SVM Optimization (SMO)

yiαi + yjαj = yiαold

i

+ yjαold

j

= γ

Machine Learning: Chenhao Tan | Boulder | 33 of 52

slide-59
SLIDE 59

Sequential Mimimal Optimization

Step 2: Optimize αj

  • 1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.

yi = yj

L = max(0, αj − αi) (23) H = min(C, C + αj − αi) (24)

yi = yj

L = max(0, αi + αj − C) (25) H = min(C, αj + αi) (26)

Machine Learning: Chenhao Tan | Boulder | 34 of 52

slide-60
SLIDE 60

Sequential Mimimal Optimization

Step 2: Optimize αj

  • 1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.

yi = yj

L = max(0, αj − αi) (23) H = min(C, C + αj − αi) (24)

yi = yj

L = max(0, αi + αj − C) (25) H = min(C, αj + αi) (26) This is because the update for αi is based on yiyj (sign matters)

Machine Learning: Chenhao Tan | Boulder | 34 of 52

slide-61
SLIDE 61

Sequential Mimimal Optimization

Step 2: Optimize αj

Compute errors for i and j Ek ≡ f(xk) − yk (27) η = 2xi · xj − xi · xi − xj · xj (28) for new value for αj α∗

j = α(old) j

− yj(Ei − Ej) η (29)

Machine Learning: Chenhao Tan | Boulder | 35 of 52

slide-62
SLIDE 62

Sequential Mimimal Optimization

Step 3: Optimize αi

Set αi: α∗

i = α(old) i

+ yiyj

  • α(old)

j

− αj

  • (30)

Machine Learning: Chenhao Tan | Boulder | 36 of 52

slide-63
SLIDE 63

Sequential Mimimal Optimization

Step 3: Optimize αi

Set αi: α∗

i = α(old) i

+ yiyj

  • α(old)

j

− αj

  • (30)

This balances out the move that we made for αj.

Machine Learning: Chenhao Tan | Boulder | 36 of 52

slide-64
SLIDE 64

Sequential Mimimal Optimization

Overall algorithm

Iterate over i = {1, . . . m} Repeat until KKT conditions are met Choose j randomly from m − 1 other options Update αi, αj Find w, b based on stationarity conditions

Machine Learning: Chenhao Tan | Boulder | 37 of 52

slide-65
SLIDE 65

Sequential Mimimal Optimization

Iterations / Details

  • What if i doesn’t violate the KKT conditions?
  • What if η ≥ 0?
  • When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

slide-66
SLIDE 66

Sequential Mimimal Optimization

Iterations / Details

  • What if i doesn’t violate the KKT conditions? Skip it!
  • What if η ≥ 0?
  • When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

slide-67
SLIDE 67

Sequential Mimimal Optimization

Iterations / Details

  • What if i doesn’t violate the KKT conditions? Skip it!
  • What if η ≥ 0? Skip it! (should not happen except for numerical instability)
  • When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

slide-68
SLIDE 68

Sequential Mimimal Optimization

Iterations / Details

  • What if i doesn’t violate the KKT conditions? Skip it!
  • What if η ≥ 0? Skip it! (should not happen except for numerical instability)
  • When do we stop? Until we go through α’s without changing anything

Machine Learning: Chenhao Tan | Boulder | 38 of 52

slide-69
SLIDE 69

Sequential Mimimal Optimization

SMO Algorithm

Positive

(-2, 2) (0, 4) (2, 1)

4 1 2 3 5

positive negative

Negative

(-2, -3) (0, -1) (2, -3)

Machine Learning: Chenhao Tan | Boulder | 39 of 52

slide-70
SLIDE 70

Sequential Mimimal Optimization

SMO Algorithm

Positive

(-2, 2) (0, 4) (2, 1)

4 1 2 3 5

positive negative

Negative

(-2, -3) (0, -1) (2, -3)

  • Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

Machine Learning: Chenhao Tan | Boulder | 39 of 52

slide-71
SLIDE 71

Sequential Mimimal Optimization

SMO Algorithm

Positive

(-2, 2) (0, 4) (2, 1)

4 1 2 3 5

positive negative

Negative

(-2, -3) (0, -1) (2, -3)

  • Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

  • Intercept b is also zero
  • Capacity C = π

Machine Learning: Chenhao Tan | Boulder | 39 of 52

slide-72
SLIDE 72

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0)
  • Prediction: f(x4)
  • Error: E0
  • Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-73
SLIDE 73

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0) = 0
  • Prediction: f(x4)
  • Error: E0
  • Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-74
SLIDE 74

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0) = 0
  • Prediction: f(x4) = 0
  • Error: E0
  • Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-75
SLIDE 75

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0) = 0
  • Prediction: f(x4) = 0
  • Error: E0 = −1
  • Error: E4 = +1

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-76
SLIDE 76

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0) = 0
  • Prediction: f(x4) = 0
  • Error: E0 = −1
  • Error: E4 = +1

η = 2x0, x4 − x0, x0 − x4, x4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-77
SLIDE 77

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

4 1 2 3 5

positive negative

  • Prediction: f(x0) = 0
  • Prediction: f(x4) = 0
  • Error: E0 = −1
  • Error: E4 = +1

η = 2x0, x4 − x0, x0 − x4, x4 = 2 · −2 − 8 − 1 = −13

Machine Learning: Chenhao Tan | Boulder | 40 of 52

slide-78
SLIDE 78

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

  • Lower and upper bounds for αj

L = max(0, αj − αi) (31) H = min(C, C + αj − αi) (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

slide-79
SLIDE 79

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

  • Lower and upper bounds for αj

L = max(0, αj − αi) = 0 (31) H = min(C, C + αj − αi) (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

slide-80
SLIDE 80

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

  • Lower and upper bounds for αj

L = max(0, αj − αi) = 0 (31) H = min(C, C + αj − αi) = π (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

slide-81
SLIDE 81

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj α∗

j = αj − yj(Ei − Ej)

η (33) (34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

slide-82
SLIDE 82

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj α∗

j = αj − yj(Ei − Ej)

η = −2 η = 2 13 (33) (34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

slide-83
SLIDE 83

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj α∗

j = αj − yj(Ei − Ej)

η = −2 η = 2 13 (33) New value for αi (34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

slide-84
SLIDE 84

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj α∗

j = αj − yj(Ei − Ej)

η = −2 η = 2 13 (33) New value for αi α∗

i = αi + yiyj

  • α(old)

j

− αj

  • (34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

slide-85
SLIDE 85

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj α∗

j = αj − yj(Ei − Ej)

η = −2 η = 2 13 (33) New value for αi α∗

i = αi + yiyj

  • α(old)

j

− αj

  • = αj = 2

13 (34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

slide-86
SLIDE 86

Sequential Mimimal Optimization

Margin

Machine Learning: Chenhao Tan | Boulder | 43 of 52

slide-87
SLIDE 87

Sequential Mimimal Optimization

Find weight vector and bias

  • Weight vector
  • w =

m

  • i

αiyi xi (35)

  • Bias

b =b(old) − Ei − yi(α∗

i − α(old) i

)xi · xi − yj(α∗

j − α(old) j

)xi · xj (36) (37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

slide-88
SLIDE 88

Sequential Mimimal Optimization

Find weight vector and bias

  • Weight vector
  • w =

m

  • i

αiyi xi = 2 13 −2 2

  • − 2

13 −1

  • (35)
  • Bias

b =b(old) − Ei − yi(α∗

i − α(old) i

)xi · xi − yj(α∗

j − α(old) j

)xi · xj (36) (37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

slide-89
SLIDE 89

Sequential Mimimal Optimization

Find weight vector and bias

  • Weight vector
  • w =

m

  • i

αiyi xi = 2 13 −2 2

  • − 2

13 −1

  • =

−4

13 6 13

  • (35)
  • Bias

b =b(old) − Ei − yi(α∗

i − α(old) i

)xi · xi − yj(α∗

j − α(old) j

)xi · xj (36) (37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

slide-90
SLIDE 90

Sequential Mimimal Optimization

Find weight vector and bias

  • Weight vector
  • w =

m

  • i

αiyi xi = 2 13 −2 2

  • − 2

13 −1

  • =

−4

13 6 13

  • (35)
  • Bias

b =b(old) − Ei − yi(α∗

i − α(old) i

)xi · xi − yj(α∗

j − α(old) j

)xi · xj (36) =1 − 2 13 · 8 + 2 13 · −2 = −0.54 (37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

slide-91
SLIDE 91

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

4 1 2 3 5

positive negative

Let’s skip the boring stuff

  • E2 = −1.69
  • E4 = 0.00
  • η = −8
  • α4 = α(old)

j

− yj(Ei−Ej)

η

  • α2 = α(old)

i

+ yiyj

  • α(old)

j

− αj

  • Machine Learning: Chenhao Tan

| Boulder | 45 of 52

slide-92
SLIDE 92

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

4 1 2 3 5

positive negative

Let’s skip the boring stuff

  • E2 = −1.69
  • E4 = 0.00
  • η = −8
  • α4 = α(old)

j

− yj(Ei−Ej)

η

  • α2 = α(old)

i

+ yiyj

  • α(old)

j

− αj

  • Machine Learning: Chenhao Tan

| Boulder | 45 of 52

slide-93
SLIDE 93

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

4 1 2 3 5

positive negative

Let’s skip the boring stuff

  • E2 = −1.69
  • E4 = 0.00
  • η = −8
  • α4 = α(old)

j

− yj(Ei−Ej)

η

= 0.15 + −1.69

−8

= 0.37

  • α2 = α(old)

i

+ yiyj

  • α(old)

j

− αj

  • Machine Learning: Chenhao Tan

| Boulder | 45 of 52

slide-94
SLIDE 94

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

4 1 2 3 5

positive negative

Let’s skip the boring stuff

  • E2 = −1.69
  • E4 = 0.00
  • η = −8
  • α4 = α(old)

j

− yj(Ei−Ej)

η

= 0.15 + −1.69

−8

= 0.37

  • α2 = α(old)

i

+ yiyj

  • α(old)

j

− αj

  • =

0 − (0.15 − 0.37) = 0.21

Machine Learning: Chenhao Tan | Boulder | 45 of 52

slide-95
SLIDE 95

Sequential Mimimal Optimization

Margin

Machine Learning: Chenhao Tan | Boulder | 46 of 52

slide-96
SLIDE 96

Sequential Mimimal Optimization

Weight vector and bias

  • Bias b = −0.12
  • Weight vector
  • w =

m

  • i

αiyi xi (38)

Machine Learning: Chenhao Tan | Boulder | 47 of 52

slide-97
SLIDE 97

Sequential Mimimal Optimization

Weight vector and bias

  • Bias b = −0.12
  • Weight vector
  • w =

m

  • i

αiyi xi = 0.12 0.88

  • (38)

Machine Learning: Chenhao Tan | Boulder | 47 of 52

slide-98
SLIDE 98

Sequential Mimimal Optimization

Another Iteration (i = 0, j = 2)

Machine Learning: Chenhao Tan | Boulder | 48 of 52

slide-99
SLIDE 99

Sequential Mimimal Optimization

SMO Algorithm

  • Convenient approach for solving: vanilla, slack, kernel approaches
  • Convex problem
  • Scalable to large datasets (implemented in scikit learn)
  • What we didn’t do:
  • Check KKT conditions
  • Randomly choose indices

Machine Learning: Chenhao Tan | Boulder | 49 of 52

slide-100
SLIDE 100

Recap

Outline

Duality Slack variables Sequential Mimimal Optimization Recap

Machine Learning: Chenhao Tan | Boulder | 50 of 52

slide-101
SLIDE 101

Recap

Recap

  • Duality
  • Slack variables

Machine Learning: Chenhao Tan | Boulder | 51 of 52

slide-102
SLIDE 102

Recap

Recap

  • SMO: Optimize objective function for two data points
  • Convex problem: Will converge
  • Relatively fast
  • Gives good performance

Machine Learning: Chenhao Tan | Boulder | 51 of 52

slide-103
SLIDE 103

Recap

Wrapup

  • Adding slack variables don’t break the SVM problem
  • Very popular algorithm
  • SVMLight (many options)
  • Libsvm / Liblinear (very fast)
  • Weka (friendly)
  • pyml (Python focused, from Colorado)
  • Next up: simple algorithm for finding SVMs

Machine Learning: Chenhao Tan | Boulder | 52 of 52