Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - - PowerPoint PPT Presentation

local function optimization
SMART_READER_LITE
LIVE PREVIEW

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - - PowerPoint PPT Presentation

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Local Function Optimization 1 / 29 Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4


slide-1
SLIDE 1

Local Function Optimization

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Local Function Optimization 1 / 29

slide-2
SLIDE 2

Outline

1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4 Termination 5 Convergence Speed of Steepest Descent 6 Convergence Speed of Newton’s Method 7 Newton’s Method 8 Counting Steps versus Clocking

COMPSCI 371D — Machine Learning Local Function Optimization 2 / 29

slide-3
SLIDE 3

Motivation and Scope

  • Parametric predictor: h(x ; v) : Rd × Rm → Y
  • As a predictor: h(x ; v) = hv(x) : Rd → Y
  • Risk: LT(v) = 1

N

N

n=1 ℓ(yn, h(xn ; v)) : Rm → R

  • For risk minimization, h(xn ; v) = hxn(v) : Rm → Y
  • Training a parametric predictor with m real parameters

is function optimization: ˆ v ∈ arg minv∈R

m LT(v)

  • Some v may be subject to constraints.

We ignore those ML problems for now.

  • Other v may be integer-valued. We ignore those, too

(combinatorial optimization).

COMPSCI 371D — Machine Learning Local Function Optimization 3 / 29

slide-4
SLIDE 4

Example

  • A binary linear classifier has decision boundary b + wTx = 0
  • So v =

b w

  • ∈ Rd+1
  • m = d + 1
  • Counterexample: Can you think of a ML method

that does not involve v ∈ Rm?

COMPSCI 371D — Machine Learning Local Function Optimization 4 / 29

slide-5
SLIDE 5

Warning: Change of Notation

  • Optimization is used for much more than ML
  • Even in ML, there are more than risks to optimize
  • So we use “generic notation” for optimization
  • Function to be minimized f(u) : Rm → R
  • More in keeping with literature...

... except that we use u instead of x (too loaded for us!)

  • Minimizing f(u) is the same as maximizing −f(u)

COMPSCI 371D — Machine Learning Local Function Optimization 5 / 29

slide-6
SLIDE 6

Only Local Minimization

  • All we know about f is a “black box” (think Python function)
  • For many problems, f has many local minima
  • Start somewhere (u0), and take steps “down”

f(uk+1) < f(uk)

  • When we get stuck at a local minimum, we declare success
  • We would like global minima, but all we get is local ones
  • For some problems, f has a unique minimum...
  • ... or at least a single connected set of minima

COMPSCI 371D — Machine Learning Local Function Optimization 6 / 29

slide-7
SLIDE 7

Gradient, Hessian, and Convexity

Gradient

∇f(u) = ∂f

∂u =

  

∂f ∂u1

. . .

∂f ∂um

  

  • ∇f(u) is the direction of fastest growth of f at u
  • If ∇f(u) exists everywhere, the condition

∇f(u) = 0 is necessary and sufficient for a stationary point (max, min, or saddle)

  • Warning: only necessary for a minimum!
  • Reduces to first derivative for f : R → R

COMPSCI 371D — Machine Learning Local Function Optimization 7 / 29

slide-8
SLIDE 8

Gradient, Hessian, and Convexity

First Order Taylor Expansion

f(u) ≈ g1(u) = f(u0) + [∇f(u0)]T(u − u0) approximates f(u) near u0 with a (hyper)plane through u0

u

1

u

2

f(u) u

∇f(u0) points to direction of steepest increase of f at u0

  • If we want to find u1 where f(u1) < f(u0), going along

−∇f(u0) seems promising

  • This is the general idea of steepest descent

COMPSCI 371D — Machine Learning Local Function Optimization 8 / 29

slide-9
SLIDE 9

Gradient, Hessian, and Convexity

Hessian

H(u) =    

∂2f ∂u2

1

. . .

∂2f ∂u1∂um

. . . . . .

∂2f ∂um∂u1

. . .

∂2f ∂u2

m

   

  • Symmetric matrix because of Schwarz’s theorem:

∂2f ∂ui∂uj = ∂2f ∂uj∂ui

  • Eigenvalues are real because of symmetry
  • Reduces to d2f

du2 for f : R → R

COMPSCI 371D — Machine Learning Local Function Optimization 9 / 29

slide-10
SLIDE 10

Gradient, Hessian, and Convexity

Convexity

u u' z u + (1-z) u' f(z u + (1-z) u') z f(u) + (1-z) f(u') f(u') f(u)

  • Convex everywhere:

For all u, u′ in the (open) domain of f and for all z ∈ [0, 1] f(zu + (1 − z)u′) ≤ zf(u) + (1 − z)f(u′)

  • Convex at u0: The function f is convex everywhere in some
  • pen neighborhood of u0

COMPSCI 371D — Machine Learning Local Function Optimization 10 / 29

slide-11
SLIDE 11

Gradient, Hessian, and Convexity

Convexity and Hessian

  • If H(u) is defined at a stationary point u of f, then u is a

minimum iff H(u) 0

  • “” means positive semidefinite:

uTHu ≥ 0 for all u ∈ Rm

  • Above is definition of H(u) 0
  • To check computationally: All eigenvalues are nonnegative
  • H(u) 0 reduces to d2f

du2 ≥ 0 for f : R → R

COMPSCI 371D — Machine Learning Local Function Optimization 11 / 29

slide-12
SLIDE 12

Gradient, Hessian, and Convexity

Second Order Taylor Expansion

f ≈ g2(u) = f(u0) + [∇f(u0)]T(u − u0) + (u − u0)TH(u0)(u − u0) approximates f(u) near u0 with a quadratic equation through u0

  • For minimization, this is useful only when H(u) 0
  • Function looks locally like a bowl

u

1

u

2

f(u) u

1

u

  • If we want to find u1 where f(u1) < f(u0), going to the

bottom of the bowl seems promising

  • This is the general idea of Newton’s method

COMPSCI 371D — Machine Learning Local Function Optimization 12 / 29

slide-13
SLIDE 13

A Local, Unconstrained Optimization Template

A Template

  • Regardless of method, most local unconstrained
  • ptimization methods fit the following template,

given a starting point u0: k = 0 while uk is not a minimum compute step direction pk compute step-size multiplier αk > 0 uk+1 = uk + αkpk k = k + 1 end.

COMPSCI 371D — Machine Learning Local Function Optimization 13 / 29

slide-14
SLIDE 14

A Local, Unconstrained Optimization Template

Design Decisions

  • Whether to stop (“while uk is not a minimum”)
  • In what direction to proceed (pk)
  • How long a step to take in that direction (αk)
  • Different decisions for the last two lead to different methods

with very different behaviors and computational costs

COMPSCI 371D — Machine Learning Local Function Optimization 14 / 29

slide-15
SLIDE 15

Steepest Descent

Steepest Descent: Follow the Gradient

  • In what direction to proceed: pk = −∇f(uk)
  • “Steepest descent” or “gradient descent”
  • Problem reduces to one dimension: h(α) = f(uk + αpk)
  • α = 0 ⇒ u = uk
  • Find α = αk > 0 s.t. f(uk + αkpk)

is a local minimum along the line

  • Line search (search along a line)
  • Q1: How to find αk?
  • Q2: Is this a good strategy?

COMPSCI 371D — Machine Learning Local Function Optimization 15 / 29

slide-16
SLIDE 16

Steepest Descent

Line Search

  • Bracketing triple:
  • a < b < c,

h(a) ≥ h(b), h(b) ≤ h(c)

  • Contains a (local) minimum!
  • Split the bigger of [a, b] and [b, c] in half with a point z
  • Find a new, narrower bracketing triple involving z and two
  • ut of a, b, c
  • Stop when the bracket is narrow enough (say, 10−6)
  • Pinned down a minimum to within 10−6

COMPSCI 371D — Machine Learning Local Function Optimization 16 / 29

slide-17
SLIDE 17

Steepest Descent

Phase 1: Find a Bracketing Triple

α h(α)

COMPSCI 371D — Machine Learning Local Function Optimization 17 / 29

slide-18
SLIDE 18

Steepest Descent

Phase 2: Shrink the Bracketing Triple

α h(α)

COMPSCI 371D — Machine Learning Local Function Optimization 18 / 29

slide-19
SLIDE 19

Steepest Descent

if b − a > c − b z = (a + b)/2 if h(z) > h(b) (a, b, c) = (z, b, c)

  • therwise

(a, b, c) = (a, z, b) end

  • therwise

z = (b + c)/2 if h(z) > h(b) (a, b, c) = (a, b, z)

  • therwise

(a, b, c) = (b, z, c) end end

COMPSCI 371D — Machine Learning Local Function Optimization 19 / 29

slide-20
SLIDE 20

Termination

Termination

  • Are we still making “significant progress”?
  • Check f(uk−1) − f(uk)? (We want this to be strictly positive)
  • Check uk−1 − uk ? (We want this to be large enough)
  • Second is more stringent close the the minimum

because ∇f(u) ≈ 0

  • Stop when uk−1 − uk < δ

COMPSCI 371D — Machine Learning Local Function Optimization 20 / 29

slide-21
SLIDE 21

Termination

Is Steepest Descent a Good Strategy?

  • “We are going in the direction of fastest descent”
  • “We choose an optimal step by line search”
  • “Must be good, no?”

Not so fast! (Pun intended)

  • An example for which we know the answer:

f(u) = c + aTu + 1

2uTQu

Q 0 (convex paraboloid)

  • All smooth functions look like this close enough to u∗

u *

isocontours

COMPSCI 371D — Machine Learning Local Function Optimization 21 / 29

slide-22
SLIDE 22

Termination

Skating to a Minimum

u0 u * p0

COMPSCI 371D — Machine Learning Local Function Optimization 22 / 29

slide-23
SLIDE 23

Termination

How to Measure Convergence Speed

  • Asymptotics (k → ∞) are what matters
  • If u∗ is the true solution, how does

uk+1 − u∗ compare with uk − u∗ for large k?

  • Which converges faster:

uk+1 − u∗ ≈ βuk − u∗1 or uk+1 − u∗ ≈ βuk − u∗2 ?

  • Close to convergence these distances are small numbers

uk − u∗2 ≪ uk − u∗1 [Example: (0.001)2 ≪ (0.001)1]

COMPSCI 371D — Machine Learning Local Function Optimization 23 / 29

slide-24
SLIDE 24

Termination

How to Measure Convergence Speed

  • A fast algorithm has a large exponent q in

uk+1 − u∗ ≈ βuk − u∗q when k is large

  • The order of convergence is the largest number q such that

0 < lim

k→∞

uk+1 − u∗ uk − u∗q < ∞

  • (The value of the limit, when finite, is β)

COMPSCI 371D — Machine Learning Local Function Optimization 24 / 29

slide-25
SLIDE 25

Convergence Speed of Steepest Descent

Convergence Speed of Steepest Descent

  • Steepest descent has order of convergence q = 1

(“linear convergence rate”) 0 < lim

k→∞

uk+1 − u∗ uk − u∗1 < ∞

  • Hopefully, when q = 1, we have at least β < 1
  • Example: β = 0.1

uk+1 − u∗ ≈ 0.1 uk − u∗1

  • Gain one correct decimal digit at every iteration

COMPSCI 371D — Machine Learning Local Function Optimization 25 / 29

slide-26
SLIDE 26

Convergence Speed of Newton’s Method

Newton’s Method

  • Newton’s method has a quadratic convergence rate:

0 < lim

k→∞

uk+1 − u∗ uk − u∗2 < ∞

  • Now things are OK even if β ≥ 1
  • Example: β = 1

uk+1 − u∗ ≈ uk − u∗2

  • Double the number of correct digits at every iteration
  • Very fast!

COMPSCI 371D — Machine Learning Local Function Optimization 26 / 29

slide-27
SLIDE 27

Newton’s Method

Newton’s Method

f ≈ g2(u) = f(uk) + [∇f(uk)]T(u − uk) + (u − uk)TH(uk)(u − uk)

  • Check that H(u) 0

(otherwise, fall back on steepest descent for that step)

  • Let ∆u = u − uk

f ≈ g2(u) = f(uk) + [∇f(uk)]T∆u + (∆u)TH(uk)∆u

u

1

u

2

f(u) u

1

u

  • Solve H(uk)∆u = −∇f(uk)

(Jump to bottom of the bowl)

  • Repeat

COMPSCI 371D — Machine Learning Local Function Optimization 27 / 29

slide-28
SLIDE 28

Counting Steps versus Clocking

Counting Steps versus Clocking

  • Newton’s method finds direction ad step at once (∆z)
  • No line search required
  • But need to evaluate gradient and Hessian at every step

m + m+1

2

  • derivatives!
  • ...and solve an m × m linear system
  • Asymptotic complexity depends on assumptions on

arithmetic (exact, fixed-precision)

  • Practical times are significant, and naive algorithms are

O(m3)

  • So, both storage space and computation time become

prohibitive as m grows

COMPSCI 371D — Machine Learning Local Function Optimization 28 / 29

slide-29
SLIDE 29

Counting Steps versus Clocking

Bottom Line

  • Newton’s method takes few steps to converge...

... but each step is expensive

  • Advantageous when m is small
  • For bigger problems, use steepest descent
  • Compromises exist

(e.g., conjugate gradients, see Appendix in the notes)

  • For very big m, even steepest descent is too expensive
  • Stay tuned

COMPSCI 371D — Machine Learning Local Function Optimization 29 / 29