Optimization for Machine Learning Lecture 2: Support Vector Machine - - PowerPoint PPT Presentation

optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Optimization for Machine Learning Lecture 2: Support Vector Machine - - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 2: Support Vector Machine Training S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41


slide-1
SLIDE 1

Optimization for Machine Learning

Lecture 2: Support Vector Machine Training S.V . N. (vishy) Vishwanathan

Purdue University vishy@purdue.edu

July 11, 2012

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41

slide-2
SLIDE 2

Linear Support Vector Machines

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 41

slide-3
SLIDE 3

Linear Support Vector Machines

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

slide-4
SLIDE 4

Linear Support Vector Machines

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

slide-5
SLIDE 5

Linear Support Vector Machines

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

slide-6
SLIDE 6

Linear Support Vector Machines

Binary Classification yi = −1 yi = +1 {x | w, x + b = 0} {x | w, x + b = −1} {x | w, x + b = 1} x2 x1 w, x1 + b = +1 w, x2 + b = −1 w, x1 − x2 = 2

  • w

w, x1 − x2

  • =

2 w

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

slide-7
SLIDE 7

Linear Support Vector Machines

Linear Support Vector Machines Optimization Problem min

w,b,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi(w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

slide-8
SLIDE 8

Linear Support Vector Machines

Linear Support Vector Machines Optimization Problem min

w,b

λ 2 w2 + 1 m

m

  • i=1

max(0, 1 − yi(w, xi + b))

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

slide-9
SLIDE 9

Linear Support Vector Machines

Linear Support Vector Machines Optimization Problem min

w,b

λ 2 w2

λΩ(w)

+ 1 m

m

  • i=1

max(0, 1 − yi(w, xi + b))

  • Remp(w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

slide-10
SLIDE 10

Stochastic Optimization

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 41

slide-11
SLIDE 11

Stochastic Optimization

Stochastic Optimization Algorithms Optimization Problem (with no bias) min

w

λ 2 w2

Ω(w)

+ 1 m

m

  • i=1

max(0, 1 − yi w, xi)

  • Remp(w)

Unconstrained Nonsmooth Convex

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 41

slide-12
SLIDE 12

Stochastic Optimization

Pegasos: Stochastic Gradient Descent Require: T

1: w0 ← 0 2: for t = 1, . . . , T do 3:

ηt ←

1 λt

4:

if yt wt, xt < 1 then

5:

w′

t ← (1 − ηtλ)wt + ηtytxt

6:

else

7:

w′

t ← (1 − ηtλ)wt

8:

end if

9: end for 10: wt+1 ← min

  • 1, 1/

√ λ w′

t

  • w′

t

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 41

slide-13
SLIDE 13

Stochastic Optimization

Understanding Pegasos Objective Function Revisited J(w) = λ 2 w2 + 1 m

m

  • i=1

max(0, 1 − yi w, xi) Subgradient If yt w, xt < 1 then ∂wJt(w) = λw − ytxt else ∂wJt(w) = λw

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

slide-14
SLIDE 14

Stochastic Optimization

Understanding Pegasos Objective Function Revisited J(w) ≈ Jt(w) = λ 2 w2 + max(0, 1 − yt w, xt) Subgradient If yt w, xt < 1 then ∂wJt(w) = λw − ytxt else ∂wJt(w) = λw

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

slide-15
SLIDE 15

Stochastic Optimization

Understanding Pegasos Objective Function Revisited J(w) ≈ Jt(w) = λ 2 w2 + max(0, 1 − yt w, xt) Subgradient If yt w, xt < 1 then ∂wJt(w) = λw − ytxt else ∂wJt(w) = λw

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

slide-16
SLIDE 16

Stochastic Optimization

Understanding Pegasos Explicit Update If yt w, xt < 1 then w′

t = wt − ηt∂wJt(wt) = (1 − ληt)wt + ytxt

else w′

t = wt − ηt∂wJt(wt) = (1 − ληt)wt

Projection Project w′

t onto the set

B =

  • w s.t. w ≤ 1/

√ λ

  • S.V

. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 41

slide-17
SLIDE 17

Stochastic Optimization

Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function wt+1 = argmin

w

1 2w − wt2 + ηt Jt(w) This gives us

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

slide-18
SLIDE 18

Stochastic Optimization

Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function wt+1 = argmin

w

1 2w − wt2 + ηt Jt(w) This gives us wt+1 = wt − ηt∂wJt(wt+1)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

slide-19
SLIDE 19

Stochastic Optimization

Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function wt+1 = argmin

w

1 2w − wt2 + ηt Jt(w) This gives us wt+1 = wt − ηt ∂wJt(wt+1)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

slide-20
SLIDE 20

Stochastic Optimization

Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function wt+1 = argmin

w

1 2w − wt2 + ηt Jt(w) This gives us wt+1 ≈ wt − ηt ∂wJt(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

slide-21
SLIDE 21

Implicit Updates

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 41

slide-22
SLIDE 22

Implicit Updates

Implicit Updates What if we did not approximate ∂wJt(wt+1)? wt+1 = wt − ηt∂wJt(wt+1) Subgradient ∂wJt(w) = λw − γytxt If yt w, xt < 1 then γ = 1 If yt w, xt = 1 then γ ∈ [0, 1] If yt w, xt > 1 then γ = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

slide-23
SLIDE 23

Implicit Updates

Implicit Updates What if we did not approximate ∂wJt(wt+1)? wt+1 = wt − ηt∂wJt(wt+1) Subgradient ∂wJt(w) = λw − γytxt If yt w, xt < 1 then γ = 1 If yt w, xt = 1 then γ ∈ [0, 1] If yt w, xt > 1 then γ = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

slide-24
SLIDE 24

Implicit Updates

Implicit Updates What if we did not approximate ∂wJt(wt+1)? wt+1 = wt − ηtλwt+1 + γηtytxt Subgradient ∂wJt(w) = λw − γytxt If yt w, xt < 1 then γ = 1 If yt w, xt = 1 then γ ∈ [0, 1] If yt w, xt > 1 then γ = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

slide-25
SLIDE 25

Implicit Updates

Implicit Updates What if we did not approximate ∂wJt(wt+1)? (1 + ηtλ)wt+1 = wt + γηtytxt Subgradient ∂wJt(w) = λw − γytxt If yt w, xt < 1 then γ = 1 If yt w, xt = 1 then γ ∈ [0, 1] If yt w, xt > 1 then γ = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

slide-26
SLIDE 26

Implicit Updates

Implicit Updates What if we did not approximate ∂wJt(wt+1)? wt+1 = 1 1 + ηtλ [wt + γηtytxt] Subgradient ∂wJt(w) = λw − γytxt If yt w, xt < 1 then γ = 1 If yt w, xt = 1 then γ ∈ [0, 1] If yt w, xt > 1 then γ = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

slide-27
SLIDE 27

Implicit Updates

Implicit Updates: Case 1 The Implicit Update Condition wt+1 = 1 1 + ηtλ [wt + γηtytxt] Case 1 Suppose 1 + ηtλ < yt wt, xt. Set wt+1 = 1 1 + ηtλwt Verify yt wt+1, xt > 1 which implies that γ = 0 and the implicit update condition is satisfied

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 41

slide-28
SLIDE 28

Implicit Updates

Implicit Updates: Case 2 The Implicit Update Condition wt+1 = 1 1 + ηtλ [wt + γηtytxt] Case 2 Suppose yt wt, xt < 1 + ηtλ − ηt xt, xt. Set wt+1 = 1 1 + ηtλ [wt + ηtytxt] Verify yt wt+1, xt < 1 which implies that γ = 1 and the implicit update condition is satisfied

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 41

slide-29
SLIDE 29

Implicit Updates

Implicit Updates: Case 3 The Implicit Update Condition wt+1 = 1 1 + ηtλ [wt + γηtytxt] Case 3 Suppose 1 + ηtλ − ηt xt, xt ≤ yt wt, xt ≤ 1 + ηtλ. Set γ = 1 + ηtλ − yt wt, xt ηt xt, xt wt+1 = 1 1 + ηtλ [wt + γηtytxt] Verify γ ∈ [0, 1] and yt wt+1, xt = 1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 41

slide-30
SLIDE 30

Implicit Updates

Implicit Updates: Summary Summary wt+1 = 1 1 + ηtλ [wt + γηtytxt] If 1 + ηtλ < yt wt, xt then γ = 0 If 1 + ηtλ − ηt xt, xt ≤ yt wt, xt ≤ 1 + ηtλ then γ = 1 + ηtλ − yt wt, xt ηt xt, xt If yt wt, xt < 1 + ηtλ − ηt xt, xt then γ = 1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 41

slide-31
SLIDE 31

Implicit Updates

Implicit Updates: Summary Summary wt+1 = 1 1 + ηtλ [wt + γηtytxt] γ = min

  • 1, max
  • 0, 1 + ηtλ − yt wt, xt

ηt xt, xt

  • S.V

. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 41

slide-32
SLIDE 32

Dual Problem

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 41

slide-33
SLIDE 33

Dual Problem

Deriving the Dual Lagrangian Recall the primal problem without bias min

w,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi w, xi ≥ 1 − ξi for all i ξi ≥ 0 Introduce non-negative dual variables α and β L(w, ξ, α, β) = λ 2 w2 + 1 m

m

  • i=1

ξi −

  • i

αi(yi w, xi − 1 + ξi) −

  • i

βiξi

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41

slide-34
SLIDE 34

Dual Problem

Deriving the Dual Lagrangian Recall the primal problem without bias min

w,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi w, xi ≥ 1 − ξi for all i ξi ≥ 0 Introduce non-negative dual variables α and β L(w, ξ, α, β) = λ 2 w2 + 1 m

m

  • i=1

ξi −

  • i

αi(yi w, xi − 1 + ξi) −

  • i

βiξi

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41

slide-35
SLIDE 35

Dual Problem

Deriving the Dual Take Gradients and Set to Zero Write the gradients ∇wL(w, ξ, α, β) = λw −

  • i

αiyixi = 0 ∇ξiL(w, ξ, α, β) = 1 m − βi − αi = 0 Conclude that w = 1 λ

  • i

αiyixi 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 41

slide-36
SLIDE 36

Dual Problem

Deriving the Dual Plug back into Lagrangian Plug w = 1

λ

  • i αiyixi and βi + αi = 1

m into the Lagrangian

max

α

−D(α) := − 1 2λ

  • i,j

αiαjyiyj xi, xj +

  • i

αi s.t. 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

slide-37
SLIDE 37

Dual Problem

Deriving the Dual Plug back into Lagrangian Plug w = 1

λ

  • i αiyixi and βi + αi = 1

m into the Lagrangian

min

α

D(α) := 1 2λ

  • i,j

αiαjyiyj xi, xj −

  • i

αi s.t. 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

slide-38
SLIDE 38

Dual Problem

Deriving the Dual Plug back into Lagrangian Plug w = 1

λ

  • i αiyixi and βi + αi = 1

m into the Lagrangian

min

α

D(α) := 1 2λ

  • i,j

αiαjyiyj xi, xj −

  • i

αi s.t. 0 ≤ αi ≤ 1 m Quadratic objective Linear (box) constraints

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

slide-39
SLIDE 39

Dual Problem

Coordinate Descent in the Dual One dimensional function ˆ D(αt) = α2

t

2λ xt, xt + 1 λ

  • i

αtαiyiyt xi, xt − αt + const. s.t. 0 ≤ αt ≤ 1 m Take Gradients and set to Zero ∇ˆ D(αt) = αt λ xt, xt + 1 λ

  • i

αiyiyt xi, xt − 1 = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

slide-40
SLIDE 40

Dual Problem

Coordinate Descent in the Dual One dimensional function ˆ D(αt) = α2

t

2λ xt, xt + 1 λ

  • i

αtαiyiyt xi, xt − αt + const. s.t. 0 ≤ αt ≤ 1 m Take Gradients and set to Zero ∇ˆ D(αt) = αt λ xt, xt + 1 λyt

  • wt
  • :=

i yiαixi

, xt

  • − 1 = 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

slide-41
SLIDE 41

Dual Problem

Coordinate Descent in the Dual One dimensional function ˆ D(αt) = α2

t

2λ xt, xt + 1 λ

  • i

αtαiyiyt xi, xt − αt + const. s.t. 0 ≤ αt ≤ 1 m Take Gradients and set to Zero αt = min

  • max
  • 0, λ − yt wt, xt

xt, xt

  • , 1

m

  • S.V

. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

slide-42
SLIDE 42

Dual Problem

Contrast with Implicit Updates Coordinate Descent in the Dual wt = 1 λ

  • i

αiyixi αt = min 1 m, max

  • 0, λ − yt wt, xt

xt, xt

  • Implicit Updates

wt+1 = 1 1 + ηtλ [wt + γηtytxt] γ = min

  • 1, max
  • 0, 1 + ηtλ − yt wt, xt

ηt xt, xt

  • S.V

. N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 41

slide-43
SLIDE 43

Scaling Things Up

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 41

slide-44
SLIDE 44

Scaling Things Up

What if Data Does not Fit in Memory? Idea 1: Block Minimization [Yu et al., KDD 2010] Split data into blocks B1, B2 . . . such that Bj fits in memory Compress and store each block separately Load one block of data at a time and optimize only those αi’s Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011] Split data into blocks B1, B2 . . . such that Bj fits in memory Compress and store each block separately Load one block of data at a time and optimize only those αi’s Retain informative samples from each block in main memory

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 41

slide-45
SLIDE 45

Scaling Things Up

What are Informative Samples?

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 41

slide-46
SLIDE 46

Scaling Things Up

Some Observations SBM and BM are wasteful Both split data into blocks and compress the blocks

This requires reading the entire data at least once (expensive)

Both pause optimization while a block is loaded into memory Hardware 101 Disk I/O is slower than CPU (sometimes by a factor of 100) Random access on HDD is terrible

sequential access is reasonably fast (factor of 10)

Multi-core processors are becoming commonplace How can we exploit this?

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 41

slide-47
SLIDE 47

Scaling Things Up

Dual Cached Loops [Matsushima, Vishwanathan, Smola] Reader Trainer RAM RAM HDD Data Working Set Weight Vec

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 41

slide-48
SLIDE 48

Scaling Things Up

Underlying Philosophy Iterate over the data in main memory while streaming data from

  • disk. Evict primarily examples from main memory that are

“uninformative”.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 41

slide-49
SLIDE 49

Scaling Things Up

Reader for k = 1, . . . , max iter do for i = 1, . . . , n do if |A| = Ω then randomly select i′ ∈ A A = A \ {i′} delete yi, Qii, xi from RAM end if read yi, xi from Disk calculate Qii = xi, xi store yi, Qii, xi in RAM A = A ∪ {i} end for if stopping criterion is met then exit end if end for

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 41

slide-50
SLIDE 50

Scaling Things Up

Trainer α1 = 0, w1 = 0, ε = 9, εnew = 0, β = 0.9 while stopping criterion is not met do for t = 1, . . . , n do If |A| > 0.9 × Ω then ε = βε randomly select i ∈ A and read yi, Qii, xi from RAM compute ∇iD := yi wt, xi − 1 if (αt

i = 0 and ∇iD > ε) or (αt i = C and ∇iD < −ε) then

A = A \ {i} and delete yi, Qii, xi from RAM continue end if αt+1

i

= median

  • 0, C, αt

i − ∇iD Qii

  • , wt+1 = wt + (αt+1

i

− αt

i )yixi

εnew = max(εnew, |∇iD|) end for Update stopping criterion ε = εnew end while

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 31 / 41

slide-51
SLIDE 51

Scaling Things Up

Experiments dataset n d s(%)

n+ :n−

Datasize

  • cr

3.5 M 1156 100 0.96 45.28 GB

dna

50 M 800 25 3e−3 63.04 GB

webspam-t

0.35 M 16.61 M 0.022 1.54 20.03 GB

kddb

20.01 M 29.89 M 1e-4 6.18 4.75 GB

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 32 / 41

slide-52
SLIDE 52

Scaling Things Up

Does Active Eviction Work?

0.5 1 1.5 2 2.5 ·104 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 Wall Clock Time (sec) Relative Function Value Difference Linear SVM webspam-t C = 1.0 Random Active

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 33 / 41

slide-53
SLIDE 53

Scaling Things Up

Comparison with Block Minimization

1 2 3 4 ·104 10−9 10−7 10−5 10−3 10−1 Wall Clock Time (sec) Relative Function Value Difference

  • cr C = 1.0

StreamSVM SBM BM

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

slide-54
SLIDE 54

Scaling Things Up

Comparison with Block Minimization

0.5 1 1.5 2 ·104 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 Wall Clock Time (sec) Relative Function Value Difference webspam-t C = 1.0 StreamSVM SBM BM

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

slide-55
SLIDE 55

Scaling Things Up

Comparison with Block Minimization

1 2 3 4 ·104 10−6 10−5 10−4 10−3 10−2 10−1 100 Wall Clock Time (sec) Relative Function Value Difference kddb C = 1.0 StreamSVM SBM BM

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

slide-56
SLIDE 56

Scaling Things Up

Comparison with Block Minimization

1 2 3 4 ·104 10−11 10−9 10−7 10−5 10−3 10−1 Wall Clock Time (sec) Relative Objective Function Value dna C = 1.0 StreamSVM SBM BM

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

slide-57
SLIDE 57

Scaling Things Up

Expanding Features

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 ·105 10−10 10−8 10−6 10−4 10−2 100 Wall Clock Time (sec) Relative Function Value Difference dna expanded C = 1.0 16 GB 32GB

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 35 / 41

slide-58
SLIDE 58

Bringing in the Bias

Outline

1

Linear Support Vector Machines

2

Stochastic Optimization

3

Implicit Updates

4

Dual Problem

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 36 / 41

slide-59
SLIDE 59

Bringing in the Bias

Let us Bring Back the Bias Lagrangian Recall the primal problem min

w,b,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi (w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0 Introduce non-negative dual variables α and β L(w, b, ξ, α, β) = λ 2 w2 −

  • i

βiξi + 1 m

m

  • i=1

ξi −

  • i

αi(yi (w, xi + b) − 1 + ξi)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41

slide-60
SLIDE 60

Bringing in the Bias

Let us Bring Back the Bias Lagrangian Recall the primal problem min

w,b,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi (w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0 Introduce non-negative dual variables α and β L(w, b, ξ, α, β) = λ 2 w2 −

  • i

βiξi + 1 m

m

  • i=1

ξi −

  • i

αi(yi (w, xi + b) − 1 + ξi)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41

slide-61
SLIDE 61

Bringing in the Bias

Let us Bring Back the Bias Take Gradients and Set to Zero Write the gradients ∇wL(w, b, ξ, α, β) = λw −

  • i

αiyixi = 0 ∇bL(w, b, ξ, α, β) =

  • i

αiyi = 0 ∇ξiL(w, b, ξ, α) = 1 m − βi − αi = 0 Conclude that w = 1 λ

  • i

αiyixi

  • i

αiyi = 0 and 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 41

slide-62
SLIDE 62

Bringing in the Bias

Let us Bring Back the Bias Plug back into Lagrangian Plug w = 1

λ

  • i αiyixi and βi + αi = 1

m into the Lagrangian

max

α

−D(α) := − 1 2λ

  • i,j

αiαjyiyj xi, xj +

  • i

αi s.t.

  • i

αiyi = 0 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41

slide-63
SLIDE 63

Bringing in the Bias

Let us Bring Back the Bias Plug back into Lagrangian Plug w = 1

λ

  • i αiyixi and βi + αi = 1

m into the Lagrangian

min

α

D(α) := 1 2λ

  • i,j

αiαjyiyj xi, xj −

  • i

αi s.t.

  • i

αiyi = 0 0 ≤ αi ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41

slide-64
SLIDE 64

Bringing in the Bias

Coordinate Descent in the Dual One Dimensional Function Cannot pick one coordinate so pick two! Call the two coordinates t1 and t2 ˆ D(ηt1, ηt2) =η2

t1

2λ xt1, xt1 + η2

t2

2λ xt2, xt2 + ηt1 λ

  • i

αi xi, xt1 + ηt2 λ

  • i

αi xi, xt2 + ηt1ηt2 λ xt1, xt2 − ηt1 − ηt2 + const. s.t. yt1ηt1 + yt2ηt2 = 0 0 ≤ αt1 + ηt1 ≤ 1 m 0 ≤ αt2 + ηt2 ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

slide-65
SLIDE 65

Bringing in the Bias

Coordinate Descent in the Dual One Dimensional Function Cannot pick one coordinate so pick two! Call the two coordinates t1 and t2 ηt1 = −yt2 yt1 ηt2 = η

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

slide-66
SLIDE 66

Bringing in the Bias

Coordinate Descent in the Dual One Dimensional Function Cannot pick one coordinate so pick two! Call the two coordinates t1 and t2 ˆ D(ηt1, ηt2) = η2 2λ xt1, xt1 + η2 2λ xt2, xt2 + η λ

  • i

αi xi, xt1 − ηyt1 λyt2

  • i

αi xi, xt2 − η2yt1 λyt2 xt1, xt2 − η + yt1 yt2 η + const. s.t. 0 ≤ αt1 + η ≤ 1 m 0 ≤ αt2 − yt1 yt2 η ≤ 1 m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

slide-67
SLIDE 67

Bringing in the Bias

Software LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LibLinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 41

slide-68
SLIDE 68

Bringing in the Bias

References (Incomplete) Implicit Updates

Kivinen and Warmuth, Exponentiated Gradient Versus Gradient Descent for Linear Predictors, Information and Computation, 1997. Kivinen, Warmuth, and Hassibi, The p-norm generaliziation of the LMS algorithm for adaptive filtering, IEEE Transactions on Signal Processing, 2006. Cheng, Vishwanathan, Schuurmans, Wang, and Caelli, Implicit Online Learning With Kernels, NIPS 2006. Hsieh, Chang, Lin, Keerthi, and Sundararajan, A Dual Coordinate Descent Method for Large-scale Linear SVM, ICML 2008.

SMO

Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods — Support Vector Learnin, 1999.

Dual Cached Loops

Matsushima, Vishwanathan, and Smola, Linear Support Vector Machines via Dual Cached Loops, KDD 2012.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 41

slide-69
SLIDE 69

Bringing in the Bias

References (Incomplete) Slides are loosely based on lecture notes from

http://learning.stat.purdue.edu/wiki/courses/sp2011/ 598a/lectures http://www.ee.ucla.edu/~vandenbe/shortcourses.html

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 43 / 41