Optimization for Machine Learning Lecture 4: Quasi-Newton Methods - - PowerPoint PPT Presentation

optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods - - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28 The Story So


slide-1
SLIDE 1

Optimization for Machine Learning

Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan

Purdue University vishy@purdue.edu

July 11, 2012

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28

slide-2
SLIDE 2

The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28

slide-3
SLIDE 3

The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28

slide-4
SLIDE 4

Classical Quasi-Newton Algorithms

Outline

1

Classical Quasi-Newton Algorithms

2

Non-smooth Problems

3

BFGS with Subgradients

4

Experiments

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 28

slide-5
SLIDE 5

Classical Quasi-Newton Algorithms

Broyden, Fletcher, Goldfarb, Shanno

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 28

slide-6
SLIDE 6

Classical Quasi-Newton Algorithms

Standard BFGS - I Locally Quadratic Approximation ∇J(wt) is the gradient of J at wt Ht is an n × n estimate of the Hessian of J mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) Parameter Update wt+1 = argmin

w

J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

slide-7
SLIDE 7

Classical Quasi-Newton Algorithms

Standard BFGS - I Locally Quadratic Approximation ∇J(wt) is the gradient of J at wt Ht is an n × n estimate of the Hessian of J mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) Parameter Update wt+1 = argmin

w

J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

slide-8
SLIDE 8

Classical Quasi-Newton Algorithms

Standard BFGS - I Locally Quadratic Approximation ∇J(wt) is the gradient of J at wt Ht is an n × n estimate of the Hessian of J mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) Parameter Update wt+1 = argmin

w

J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) wt+1 = wt − H−1

t

∇J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

slide-9
SLIDE 9

Classical Quasi-Newton Algorithms

Standard BFGS - I Locally Quadratic Approximation ∇J(wt) is the gradient of J at wt Ht is an n × n estimate of the Hessian of J mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) Parameter Update wt+1 = argmin

w

J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) wt+1 = wt − ηtH−1

t

∇J(wt) ηt is a step size usually found via a line search

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

slide-10
SLIDE 10

Classical Quasi-Newton Algorithms

Standard BFGS - I Locally Quadratic Approximation ∇J(wt) is the gradient of J at wt Ht is an n × n estimate of the Hessian of J mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) Parameter Update wt+1 = argmin

w

J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt) wt+1 = wt − ηtBt∇J(wt) ηt is a step size usually found via a line search Bt = H−1

t

is a symmetric PSD matrix

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

slide-11
SLIDE 11

Classical Quasi-Newton Algorithms

Standard BFGS - II B Matrix Update Update B by Bt+1 = argmin

B

||B − Bt||w s.t. st = Byt yt = ∇J(wt+1) − ∇J(wt) is the difference of gradients st = wt+1 − wt is the difference in parameters This yields the update formula Bt+1 =

  • I − sty⊤

t

st, yt

  • Bt
  • I − yts⊤

t

st, yt

  • +

sts⊤

t

st, yt Limited memory variant: use a low-rank approximation to B

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

slide-12
SLIDE 12

Classical Quasi-Newton Algorithms

Standard BFGS - II B Matrix Update Update B by Bt+1 = argmin

B

||B − Bt||w s.t. st = Byt yt = ∇J(wt+1) − ∇J(wt) is the difference of gradients st = wt+1 − wt is the difference in parameters This yields the update formula Bt+1 =

  • I − sty⊤

t

st, yt

  • Bt
  • I − yts⊤

t

st, yt

  • +

sts⊤

t

st, yt Limited memory variant: use a low-rank approximation to B

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

slide-13
SLIDE 13

Classical Quasi-Newton Algorithms

Standard BFGS - II B Matrix Update Update B by Bt+1 = argmin

B

||B − Bt||w s.t. st = Byt yt = ∇J(wt+1) − ∇J(wt) is the difference of gradients st = wt+1 − wt is the difference in parameters This yields the update formula Bt+1 =

  • I − sty⊤

t

st, yt

  • Bt
  • I − yts⊤

t

st, yt

  • +

sts⊤

t

st, yt Limited memory variant: use a low-rank approximation to B

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

slide-14
SLIDE 14

Classical Quasi-Newton Algorithms

Standard BFGS - II B Matrix Update Update B by Bt+1 = argmin

B

||B − Bt||w s.t. st = Byt yt = ∇J(wt+1) − ∇J(wt) is the difference of gradients st = wt+1 − wt is the difference in parameters This yields the update formula Bt+1 =

  • I − sty⊤

t

st, yt

  • Bt
  • I − yts⊤

t

st, yt

  • +

sts⊤

t

st, yt Limited memory variant: use a low-rank approximation to B

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

slide-15
SLIDE 15

Classical Quasi-Newton Algorithms

Line Search Wolfe Conditions Sufficient decrease: J(wt + ηtdt) ≤ J(wt) + c1ηt ∇J(wt), dt Curvature condition: ∇J(wt + ηtdt), dt ≥ c2 ∇J(wt), dt , where 0 < c1 < c2 < 1.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28

slide-16
SLIDE 16

Classical Quasi-Newton Algorithms

Line Search Wolfe Conditions Sufficient decrease: J(wt + ηtdt) ≤ J(wt) + c1ηt ∇J(wt), dt Curvature condition: ∇J(wt + ηtdt), dt ≥ c2 ∇J(wt), dt , where 0 < c1 < c2 < 1.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28

slide-17
SLIDE 17

Non-smooth Problems

Outline

1

Classical Quasi-Newton Algorithms

2

Non-smooth Problems

3

BFGS with Subgradients

4

Experiments

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 28

slide-18
SLIDE 18

Non-smooth Problems

Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

slide-19
SLIDE 19

Non-smooth Problems

Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

slide-20
SLIDE 20

Non-smooth Problems

Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this Houston we Have a Problem!

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

slide-21
SLIDE 21

Non-smooth Problems

Subgradients A subgradient at x′ is any vector s which satisfies f (x) ≥ f (x′) +

  • x − x′, s
  • for all x

Set of all subgradients is denoted as ∂f (w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

slide-22
SLIDE 22

Non-smooth Problems

Subgradients A subgradient at x′ is any vector s which satisfies f (x) ≥ f (x′) +

  • x − x′, s
  • for all x

Set of all subgradients is denoted as ∂f (w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

slide-23
SLIDE 23

Non-smooth Problems

Subgradients A subgradient at x′ is any vector s which satisfies f (x) ≥ f (x′) +

  • x − x′, s
  • for all x

Set of all subgradients is denoted as ∂f (w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

slide-24
SLIDE 24

Non-smooth Problems

Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence

−3 −2 −1 1 2 3 1 2 3

f (x) = |x| and ∂f (0) = [−1, 1]

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28

slide-25
SLIDE 25

Non-smooth Problems

Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence

−3 −2 −1 1 2 3 1 2 3

f (x) = |x| and ∂f (0) = [−1, 1]

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28

slide-26
SLIDE 26

Non-smooth Problems

Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence

−3 −2 −1 1 2 3 1 2 3

f (x) = |x| and ∂f (0) = [−1, 1]

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28

slide-27
SLIDE 27

Non-smooth Problems

Subgradients The Good, the Bad, and the Ugly The subdifferential is a convex set Not every subgradient is a descent direction! d is a descent direction if, and only if, d, s < 0 for all s ∈ ∂f (x)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 28

slide-28
SLIDE 28

Non-smooth Problems

Subgradients The Good, the Bad, and the Ugly The subdifferential is a convex set Not every subgradient is a descent direction! d is a descent direction if, and only if, d, s < 0 for all s ∈ ∂f (x)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 28

slide-29
SLIDE 29

Non-smooth Problems

Subgradients The Good, the Bad, and the Ugly The subdifferential is a convex set Not every subgradient is a descent direction! d is a descent direction if, and only if, d, s < 0 for all s ∈ ∂f (x)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 28

slide-30
SLIDE 30

Non-smooth Problems

Subgradients The Good, the Bad, and the Ugly The subdifferential is a convex set Not every subgradient is a descent direction! d is a descent direction if, and only if, d, s < 0 for all s ∈ ∂f (x)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 28

slide-31
SLIDE 31

BFGS with Subgradients

Outline

1

Classical Quasi-Newton Algorithms

2

Non-smooth Problems

3

BFGS with Subgradients

4

Experiments

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 28

slide-32
SLIDE 32

BFGS with Subgradients

When Working with Subgradients Three Things Break Down The locally quadratic approximation is no longer well defined The descent direction −Bt∇J(wt) is not well defined The line search to find ηt needs to be modified

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 28

slide-33
SLIDE 33

BFGS with Subgradients

Changing the Approximation Locally Quadratic Approximation mt(w) = J(wt) + ∇J(wt), w − wt + 1 2(w − wt)⊤Ht(w − wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 28

slide-34
SLIDE 34

BFGS with Subgradients

Changing the Approximation Locally Quadratic Approximation mt(w) = J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 28

slide-35
SLIDE 35

BFGS with Subgradients

Changing the Approximation Locally (pseudo) Quadratic Approximation mt(w) = sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)}

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 28

slide-36
SLIDE 36

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation mt(w) = sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)} Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 28

slide-37
SLIDE 37

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation wt+1 = argmin

w

sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)} Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 28

slide-38
SLIDE 38

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation wt+1 = argmin

w

sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)} wt+1 = argmin

w

1 2(w − wt)⊤Ht(w − wt) + ξ s.t. J(wt) + s, w − wt ≤ ξ for all s ∈ ∂J(wt) Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 28

slide-39
SLIDE 39

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation wt+1 = argmin

w

sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)} wk

t+1 = argmin w

1 2(w − wt)⊤Ht(w − wt) + ξ s.t. J(wt) + si, w − wt ≤ ξ for s1 . . . sk ∈ ∂J(wt) Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 28

slide-40
SLIDE 40

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation wt+1 = argmin

w

sup

s∈∂J(wt)

{J(wt) + s, w − wt + 1 2(w − wt)⊤Ht(w − wt)} wk

t+1 = argmin w

1 2(w − wt)⊤Ht(w − wt) + ξ s.t. J(wt) + si, w − wt ≤ ξ for s1 . . . sk ∈ ∂J(wt) Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 28

slide-41
SLIDE 41

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt) Parameter Update Require: maxitr

1: k ← 1, d1 ← −Bts1 for some arbitrarys1 ∈ ∂J(wt) 2: repeat 3:

sk = argsups∈∂J(wt) s, dk

4:

if sk, dk < 0 then

5:

return dk

6:

else

7:

dk+1 = argmind Pk(d), k ← k + 1

8:

end if

9: until k ≥ maxitr 10: return

failure

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 28

slide-42
SLIDE 42

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt) Parameter Update Require: maxitr

1: k ← 1, d1 ← −Bts1 for some arbitrarys1 ∈ ∂J(wt) 2: repeat 3:

sk = argsups∈∂J(wt) s, dk

4:

if sk, dk < 0 then

5:

return dk

6:

else

7:

dk+1 = argmind Pk(d), k ← k + 1

8:

end if

9: until k ≥ maxitr 10: return

failure

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 28

slide-43
SLIDE 43

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt) Parameter Update Require: maxitr

1: k ← 1, d1 ← −Bts1 for some arbitrarys1 ∈ ∂J(wt) 2: repeat 3:

sk = argsups∈∂J(wt) s, dk

4:

if sk, dk < 0 then

5:

return dk

6:

else

7:

dk+1 = argmind Pk(d), k ← k + 1

8:

end if

9: until k ≥ maxitr 10: return

failure

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 28

slide-44
SLIDE 44

BFGS with Subgradients

Descent Direction Finding Locally (pseudo) Quadratic Approximation Pk(d) = min

d

1 2d⊤Htd + ξ s.t. J(wt) + si, d ≤ ξ for s1 . . . sk ∈ ∂J(wt) Parameter Update Require: maxitr

1: k ← 1, d1 ← −Bts1 for some arbitrarys1 ∈ ∂J(wt) 2: repeat 3:

sk = argsups∈∂J(wt) s, dk

4:

if sk, dk < 0 then

5:

return dk

6:

else

7:

dk+1 = argmind Pk(d), k ← k + 1

8:

end if

9: until k ≥ maxitr 10: return

failure

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 28

slide-45
SLIDE 45

BFGS with Subgradients

The Hinge Loss Revisited The objective function J(w) := λ 2 ||w||2 + 1 m

m

  • i=1

max(0, 1 − yi xi, w)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 28

slide-46
SLIDE 46

BFGS with Subgradients

The Hinge Loss Revisited The objective function J(w) := λ 2 ||w||2 + 1 m

m

  • i=1

max(0, 1 − yi xi, w) Plotted along any direction looks like this

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.642 0.644 0.646 0.648 0.650 0.652 0.654

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 28

slide-47
SLIDE 47

BFGS with Subgradients

The Hinge Loss Revisited The objective function J(w) := λ 2 ||w||2 + 1 m

m

  • i=1

max(0, 1 − yi xi, w) When zoomed in looks like this

2.564 2.566 2.568 2.570 2.572 2.574 2.576 2.578 2.580 2.582 1.0 1.5 2.0 2.5 3.0 3.5 ×1e-7+6.430731e-1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 28

slide-48
SLIDE 48

BFGS with Subgradients

The Hinge Loss Revisited The objective function J(w) := λ 2 ||w||2 + 1 m

m

  • i=1

max(0, 1 − yi xi, w)

2.564 2.566 2.568 2.570 2.572 2.574 2.576 2.578 2.580 2.582 1.0 1.5 2.0 2.5 3.0 3.5 ×1e-7+6.430731e-1

Piecewise quadratic = ⇒ exact line search in linear time.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 28

slide-49
SLIDE 49

Experiments

Outline

1

Classical Quasi-Newton Algorithms

2

Non-smooth Problems

3

BFGS with Subgradients

4

Experiments

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 28

slide-50
SLIDE 50

Experiments

Why Not Just Use BFGS? Leukamia: 38 train, 34 test, 7129 dimensions real-sim: 57763 train, 14438 test, 20958 dimension CPU Time vs Objective Function Value

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 28

slide-51
SLIDE 51

Experiments

subBFGS: Results on a Simple Problem The Problem

−0.4 −0.2 0.2 0.4 −0.4 −0.2 0.2 0.4 2 4 6

J(w1, w2) = 10 ∗ |w1| + |w2|

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 28

slide-52
SLIDE 52

Experiments

subBFGS: Results on a Simple Problem The Problem

−0.4 −0.2 0.2 0.4 −0.4 −0.2 0.2 0.4 2 4 6 Particularly evil problem for BFGS!

J(w1, w2) = 10 ∗ |w1| + |w2|

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 28

slide-53
SLIDE 53

Experiments

subBFGS: Results on a Simple Problem BFGS

  • 1.0
  • 0.5

0.0 0.5 1.0

x

  • 0.2

0.0 0.2 0.4 0.6 0.8 1.0

y

BFGS

  • 0.01

0.00 0.01

  • 0.04

0.00 0.04

Hops from orthant to orthant Slow convergence :(

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 28

slide-54
SLIDE 54

Experiments

subBFGS: Results on a Simple Problem subBFGS

  • 1.0
  • 0.5

0.0 0.5 1.0

x

  • 0.2

0.0 0.2 0.4 0.6 0.8 1.0

y

subBFGS

Exact line search Converges in 2 iterations :)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 28

slide-55
SLIDE 55

Experiments

Are Our Modifications Helpful? INEX: 6053 train, 6054 test, 167295 dimensions, 18 classes. TMC2007: 21519 train, 7077 test, 30438 dimensions, 22 classes. CPU Time vs Objective Function Value

101 102 103 CPU Seconds 0.3 1.0 Objective Value

INEX (

  • =10

6 )

GD subGD subLBFGS

101 102 103 104 CPU Seconds 0.4 1.0 Objective Value

TMC2007 (

  • =10

5 )

GD subGD subLBFGS

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 28

slide-56
SLIDE 56

Experiments

On a Simple Toy Problem BFGS Approximation to the Objective Function and Gradient

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BFGS Quadratic Model Piecewise Linear Function 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 Gradient of BFGS Model Piecewise Constant Gradient S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 28

slide-57
SLIDE 57

Experiments

Results on some standard datasets Covertype: 522911 train, 58101 test, 54 dimensions. CPU Time vs Objective Function Value

10-1 100 101 CPU Seconds 5.3 6.2 7.2 Objective Value

Covertype (

  • =10

6 )

BMRM OCAS subLBFGS

x10

1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 28

slide-58
SLIDE 58

Experiments

Results on some standard datasets CCAT: 781265 train, 23149 test, 47236 dimensions. CPU Time vs Objective Function Value

100 101 102 CPU Seconds 1.2 2 3 4 Objective Value

CCAT (

  • =10

6 )

BMRM OCAS subLBFGS

x10

1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 28

slide-59
SLIDE 59

Experiments

The Pros and Cons of subBFGS Quasi-Newton Philosophy Use the gradients to build a quadratic approximation Initially this approximation is a good fit

Rapid initial progress

Closer to the optimum the hinges matter

Progress slows down near the optimum

Line Search subBFGS requires a line search which fulfills Wolfe conditions For binary and multiclass hinge loss an exact line search is cheap Can we do a cheap line search for structured losses?

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 28

slide-60
SLIDE 60

Experiments

References

  • J. Yu, S. V. N. Vishwanathan, S. G¨

unter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization. Submitted to JMLR (short version in ICML 2008).

  • A. Smola, S. V. N. Vishwanathan, and Q. Le.Bundle methods for

machine learning. submitted to JMLR (short version in NIPS 2007). C-H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. KDD 2007.

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 28