Optimization for Machine Learning Lecture 3: Bundle Methods S.V . - - PowerPoint PPT Presentation

optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Optimization for Machine Learning Lecture 3: Bundle Methods S.V . - - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 3: Bundle Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 30 Motivation


slide-1
SLIDE 1

Optimization for Machine Learning

Lecture 3: Bundle Methods S.V . N. (vishy) Vishwanathan

Purdue University vishy@purdue.edu

July 11, 2012

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 30

slide-2
SLIDE 2

Motivation

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 30

slide-3
SLIDE 3

Motivation

Regularized Risk Minimization Objective Function Training data: {x1, . . . , xm} Labels: {y1, . . . , ym} Learn a vector: w minimize

w

J(w) := λΩ(w)

Regularizer

+ 1 m

m

  • i=1

l(xi, yi, w)

  • Risk Remp

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 30

slide-4
SLIDE 4

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

slide-5
SLIDE 5

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

slide-6
SLIDE 6

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

slide-7
SLIDE 7

Motivation

Binary Classification yi = −1 yi = +1 {x | w, x + b = 0} {x | w, x + b = −1} {x | w, x + b = 1} x2 x1 w, x1 + b = +1 w, x2 + b = −1 w, x1 − x2 = 2

  • w

w, x1 − x2

  • =

2 w

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

slide-8
SLIDE 8

Motivation

Linear Support Vector Machines Optimization Problem max

w,b

2 w s.t. yi(w, xi + b) ≥ 1 for all i

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-9
SLIDE 9

Motivation

Linear Support Vector Machines Optimization Problem min

w,b

1 2w2 s.t. yi(w, xi + b) ≥ 1 for all i

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-10
SLIDE 10

Motivation

Linear Support Vector Machines Optimization Problem min

w,b,ξ

1 2w2 s.t. yi(w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-11
SLIDE 11

Motivation

Linear Support Vector Machines Optimization Problem min

w,b,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. yi(w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-12
SLIDE 12

Motivation

Linear Support Vector Machines Optimization Problem min

w,b,ξ

λ 2 w2 + 1 m

m

  • i=1

ξi s.t. ξi ≥ 1 − yi(w, xi + b) for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-13
SLIDE 13

Motivation

Linear Support Vector Machines Optimization Problem min

w,b

λ 2 w2 + 1 m

m

  • i=1

max(0, 1 − yi(w, xi + b))

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-14
SLIDE 14

Motivation

Linear Support Vector Machines Optimization Problem min

w,b

λ 2 w2

λΩ(w)

+ 1 m

m

  • i=1

max(0, 1 − yi(w, xi + b))

  • Remp(w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

slide-15
SLIDE 15

Motivation

Binary Hinge Loss y(w, x + b) loss

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 30

slide-16
SLIDE 16

Cutting Plane Methods

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 30

slide-17
SLIDE 17

Cutting Plane Methods

First Order Taylor Expansion The First Order Taylor approximation globally lower bounds the function For any x and x′ we have f (x) ≥ f (x′) +

  • x − x′, ∇f (x′)
  • S.V

. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 30

slide-18
SLIDE 18

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-19
SLIDE 19

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-20
SLIDE 20

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-21
SLIDE 21

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-22
SLIDE 22

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-23
SLIDE 23

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-24
SLIDE 24

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-25
SLIDE 25

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-26
SLIDE 26

Cutting Plane Methods

Cutting Plane Methods

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

slide-27
SLIDE 27

Cutting Plane Methods

In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J(w) ≥ JCP

t

(w) := max

1≤i≤t{J(wi−1) + w − wi−1, si}.

where si denotes the gradient ∇J(wi−1). At iteration t the set {wi}t−1

i=0 is augmented by

wt := argmin

w

JCP

t

(w). Stop when the duality gap ǫt := min

0≤i≤t J(wi) − JCP t

(wt) falls below a pre-specified threshold ǫ.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

slide-28
SLIDE 28

Cutting Plane Methods

In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J(w) ≥ JCP

t

(w) := max

1≤i≤t{J(wi−1) + w − wi−1, si}.

where si denotes the gradient ∇J(wi−1). At iteration t the set {wi}t−1

i=0 is augmented by

wt := argmin

w

JCP

t

(w). Stop when the duality gap ǫt := min

0≤i≤t J(wi) − JCP t

(wt) falls below a pre-specified threshold ǫ.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

slide-29
SLIDE 29

Cutting Plane Methods

In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J(w) ≥ JCP

t

(w) := max

1≤i≤t{J(wi−1) + w − wi−1, si}.

where si denotes the gradient ∇J(wi−1). At iteration t the set {wi}t−1

i=0 is augmented by

wt := argmin

w

JCP

t

(w). Stop when the duality gap ǫt := min

0≤i≤t J(wi) − JCP t

(wt) falls below a pre-specified threshold ǫ.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

slide-30
SLIDE 30

Non Smooth Functions

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 30

slide-31
SLIDE 31

Non Smooth Functions

What if the Function is NonSmooth? The piecewise linear function J(w) := max

i

ui, w is convex but not differentiable at the kinks!

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 30

slide-32
SLIDE 32

Non Smooth Functions

Subgradients to the Rescue A subgradient at w′ is any vector s which satisfies J(w) ≥ J(w′) +

  • w − w′, s
  • for all w

Set of all subgradients is denoted as ∂J(w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

slide-33
SLIDE 33

Non Smooth Functions

Subgradients to the Rescue A subgradient at w′ is any vector s which satisfies J(w) ≥ J(w′) +

  • w − w′, s
  • for all w

Set of all subgradients is denoted as ∂J(w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

slide-34
SLIDE 34

Non Smooth Functions

Subgradients to the Rescue A subgradient at w′ is any vector s which satisfies J(w) ≥ J(w′) +

  • w − w′, s
  • for all w

Set of all subgradients is denoted as ∂J(w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

slide-35
SLIDE 35

Non Smooth Functions

Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30

slide-36
SLIDE 36

Non Smooth Functions

Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one Then what is the bad news?

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30

slide-37
SLIDE 37

Non Smooth Functions

Bad News −1 −0.5 0.5 1 −1 −0.5 0.5 1 0.6 0.8 1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 30

slide-38
SLIDE 38

Bundle Methods

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 30

slide-39
SLIDE 39

Bundle Methods

Bundle Methods Stabilized Cutting Plane Method proximal: wt := argmin

w

{ζt

2 w − ˆ

wt−12 + JCP

t

(w)} trust region: wt := argmin

w

{JCP

t

(w) s.t. 1

2 w − ˆ

wt−12 ≤ κt} level set: wt := argmin

w

{1

2 w − ˆ

wt−12 s.t. JCP

t

(w) ≤ τt} Two Kinds of Steps/Iterations Null Step: Enrich the local model of the objective function Serious Step: Decrease the objective function value

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 30

slide-40
SLIDE 40

Bundle Methods

Bundle Methods Stabilized Cutting Plane Method proximal: wt := argmin

w

{ζt

2 w − ˆ

wt−12 + JCP

t

(w)} trust region: wt := argmin

w

{JCP

t

(w) s.t. 1

2 w − ˆ

wt−12 ≤ κt} level set: wt := argmin

w

{1

2 w − ˆ

wt−12 s.t. JCP

t

(w) ≤ τt} Two Kinds of Steps/Iterations Null Step: Enrich the local model of the objective function Serious Step: Decrease the objective function value

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 30

slide-41
SLIDE 41

Bundle Methods

Bundle Methods Stabilized Cutting Plane Method proximal: wt := argmin

w

{ζt

2 w − ˆ

wt−12 + JCP

t

(w)} trust region: wt := argmin

w

{JCP

t

(w) s.t. 1

2 w − ˆ

wt−12 ≤ κt} level set: wt := argmin

w

{1

2 w − ˆ

wt−12 s.t. JCP

t

(w) ≤ τt} Two Kinds of Steps/Iterations Null Step: Enrich the local model of the objective function Serious Step: Decrease the objective function value

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 30

slide-42
SLIDE 42

Bundle Methods

Bundle Methods Stabilized Cutting Plane Method proximal: wt := argmin

w

{ζt

2 w − ˆ

wt−12 + JCP

t

(w)} trust region: wt := argmin

w

{JCP

t

(w) s.t. 1

2 w − ˆ

wt−12 ≤ κt} level set: wt := argmin

w

{1

2 w − ˆ

wt−12 s.t. JCP

t

(w) ≤ τt} Two Kinds of Steps/Iterations Null Step: Enrich the local model of the objective function Serious Step: Decrease the objective function value Both involve expensive function and gradient evaluation

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 30

slide-43
SLIDE 43

BMRM

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 30

slide-44
SLIDE 44

BMRM

Key Observation The regularized risk already comes with stabilization built in minimize

w

J(w) := λΩ(w)

Regularizer

+ 1 m

m

  • i=1

l(xi, yi, w)

  • Risk Remp

Bundle Method for Regularized Risk Minimization (BMRM)

1: input & initialization: ǫ ≥ 0, w0, t ← 0 2: repeat 3:

t ← t + 1

4:

Compute at ∈ ∂wRemp(wt−1) and bt ← Remp(wt−1) − wt−1, at

5:

Update model: RCP

t

(w) := max1≤i≤t{w, ai + bi}

6:

wt ← argminw Jt(w) := λΩ(w) + RCP

t

(w)

7:

ǫt ← min0≤i≤t J(wi) − Jt(wt)

8: until ǫt ≤ ǫ

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 30

slide-45
SLIDE 45

BMRM

Key Observation The regularized risk already comes with stabilization built in minimize

w

J(w) := λΩ(w)

Regularizer

+ 1 m

m

  • i=1

l(xi, yi, w)

  • Risk Remp

Bundle Method for Regularized Risk Minimization (BMRM)

1: input & initialization: ǫ ≥ 0, w0, t ← 0 2: repeat 3:

t ← t + 1

4:

Compute at ∈ ∂wRemp(wt−1) and bt ← Remp(wt−1) − wt−1, at

5:

Update model: RCP

t

(w) := max1≤i≤t{w, ai + bi}

6:

wt ← argminw Jt(w) := λΩ(w) + RCP

t

(w)

7:

ǫt ← min0≤i≤t J(wi) − Jt(wt)

8: until ǫt ≤ ǫ

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 30

slide-46
SLIDE 46

BMRM

Key Observation The regularized risk already comes with stabilization built in minimize

w

J(w) := λΩ(w)

Regularizer

+ 1 m

m

  • i=1

l(xi, yi, w)

  • Risk Remp

Bundle Method for Regularized Risk Minimization (BMRM)

1: input & initialization: ǫ ≥ 0, w0, t ← 0 2: repeat 3:

t ← t + 1

4:

Compute at ∈ ∂wRemp(wt−1) and bt ← Remp(wt−1) − wt−1, at

5:

Update model: RCP

t

(w) := max1≤i≤t{w, ai + bi}

6:

wt ← argminw Jt(w) := λΩ(w) + RCP

t

(w)

7:

ǫt ← min0≤i≤t J(wi) − Jt(wt)

8: until ǫt ≤ ǫ

Look Ma no parameters!

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 30

slide-47
SLIDE 47

Convergence Analysis

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 30

slide-48
SLIDE 48

Convergence Analysis

Convergence Rates Theorem Assume ∂Remp(w) ≤ G for all w

  • ∂2Ω(w)
  • ≥ H for all w

For any ǫ < 4G 2/Hλ BMRM converges to the desired precision after n ≤ log2 HλJ(0) G 2 + 8G 2 Hλǫ − 1

  • steps. Furthermore if the norm of the Hessian of J(w) is bounded by ¯

H, convergence to any ǫ ≤ ¯ H/2 takes at most the following number of steps: n ≤ log2 HλJ(0) 4G 2 + 4 Hλ max

  • 0, ¯

H − 8G 2 Hλ

  • + 4¯

H Hλ log ¯ H 2ǫ

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 30

slide-49
SLIDE 49

Convergence Analysis

Proof Intuition Let A = [a1, . . . , at] and b = [b1, . . . , bt] where at ∈ ∂Remp(wt−1) and bt := Remp(wt−1) − wt−1, at. The dual problem of wt = argmin

w∈Rd {Jt(w) := λ 2 w2 2 + max 1≤i≤t w, ai + bi

  • RCP

t

(w)

} is αt = argmax

α∈Rt {− 1 2λα⊤A⊤Aα + α⊤b s.t. α ≥ 0, α1 = 1}.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 30

slide-50
SLIDE 50

Convergence Analysis

Proof Intuition Lower bound improvement in gap due to this maximization αt = argmax

α∈Rt {− 1 2λα⊤A⊤Aα + α⊤b s.t. α ≥ 0, α1 = 1}.

by improvement in gap due to 1-d maximization argmax

η∈R

− 1

  • (1 − η)αt−1, η
  • A⊤A

(1 − η)αt−1 η

  • + b⊤

(1 − η)αt−1 η

  • s.t. η ∈ [0, 1].

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 30

slide-51
SLIDE 51

Convergence Analysis

Proof Intuition Since function is strongly convex we can show ǫt − ǫt+1 ≥ ǫt

2 min(1, Hλǫt/4G 2).

Claim follows by using induction.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 30

slide-52
SLIDE 52

Convergence Analysis

Comparision with Other Proofs Best know rates for general bundle methods is O(1/ǫ3) Our solver is specialized and hence better rates of convergence Results improve upon those of Tsochantaridis et al. who show O(1/ǫ2) rates for a cutting plane based solver

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 30

slide-53
SLIDE 53

Experiments

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 30

slide-54
SLIDE 54

Experiments

Convergence Behavior: Binary Classification RCV1: 677,399 examples, 47236 dimensions Iterations vs approximation gap

100 101 102 103 104 Iteration t 10−4 10−3 10−2 10−1 100 ǫt

λ = 1e-3 λ = 1e-4 λ = 1e-5 λ = 1e-6 O(1/ǫ) O(log(1/ǫ)) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 30

slide-55
SLIDE 55

Experiments

Convergence Behavior: Binary Classification News20: 19,954 examples, 1,355,191 dimensions Iterations vs approximation gap

100 101 102 103 104 Iteration t 10−4 10−3 10−2 10−1 100 ǫt

λ = 1e-3 λ = 1e-4 λ = 1e-5 λ = 1e-6 O(1/ǫ) O(log(1/ǫ)) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 30

slide-56
SLIDE 56

Experiments

Convergence Behavior: Binary Classification Worm: 1,026,036 examples, 804 dimensions Iterations vs approximation gap

100 101 102 103 104 Iteration t 10−4 10−3 10−2 10−1 100 ǫt

λ = 1e-3 λ = 1e-4 λ = 1e-5 λ = 1e-6 O(1/ǫ) O(log(1/ǫ)) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 30

slide-57
SLIDE 57

Lower Bounds

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 30

slide-58
SLIDE 58

Lower Bounds

Are the Rates Optimal? Counter Example Given ǫ > 0, define m = 2

ǫ, yi = (−1)i, xi ∈ Rm+1 such that

xi = (−1)i               √m . . . m . . .              

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 30

slide-59
SLIDE 59

Lower Bounds

Are the Rates Optimal? Objective Function Set λ = 1. Then the regularized risk is J(w) = 1 2 w2 + 1 m

m

  • i=1

max(0, 1 − yi xi, w) = 1 2 w2 + 1 m

m

  • i=1

max(0, 1 − √mw1 − mwi+1). Minimizer w∗ =

  • 1

2√m, 1 2m, 1 2m, . . . , 1 2m

⊤ with J(w∗) =

1 4m

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 30

slide-60
SLIDE 60

Lower Bounds

Are the Rates Optimal? Theorem Let w0 =

  • 1

√m, 0, 0, . . .

⊤ . Then min

1≤i≤t J(wi) − J(w∗) > ǫ for all t < 2

3ǫ. The crux of the proof is to show that wt = 1

√m, t copies

  • 1

t , . . . , 1 t , 0, . . .

⊤.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 30

slide-61
SLIDE 61

Lower Bounds

Understanding the Lower Bounds? What do the Upper Bounds Guarantee? ∃ c, ∀ ǫ > 0, ∀ J ∈ F, T(ǫ; J) ≤ c ǫ What do the Lower Bounds Guarantee? ∀ ǫ > 0, ∃ c, ∃ Jǫ ∈ F, s.t. T(ǫ; Jǫ) ≥ c ǫ

Input 1 2 3 4 Input 1 2 3 4 N u m b e r

  • f

i t e r a t i

  • n

s 1 2 3 4 5 6 7

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 30

slide-62
SLIDE 62

References

Outline

1

Motivation

2

Cutting Plane Methods

3

Non Smooth Functions

4

Bundle Methods

5

BMRM

6

Convergence Analysis

7

Experiments

8

Lower Bounds

9

References

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 30

slide-63
SLIDE 63

References

References

  • X. Zhang, A. Saha, and S. V. N. Vishwanathan. Lower Bounds for

BMRM and Faster Rates for Training SVMs. NIPS 2010. C-H. Teo, S. V. N. Vishwanathan, A. Smola, and Q. V. Le. Bundle Methods for Regularized Risk Minimization. JMLR 11:311-365, January 2010.

  • A. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle Methods for

Machine Learning. NIPS 2007. C-H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. KDD 2007.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 30