Derivative-Free Optimization of Noisy Functions via Quasi-Newton - - PowerPoint PPT Presentation

derivative free optimization of noisy functions via quasi
SMART_READER_LITE
LIVE PREVIEW

Derivative-Free Optimization of Noisy Functions via Quasi-Newton - - PowerPoint PPT Presentation

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory Richard Byrd University of Colorado, Boulder Albert Berahas Jorge Nocedal Northwestern University Northwestern University


slide-1
SLIDE 1

1

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory Richard Byrd

University of Colorado, Boulder Huatulco, Jan 2018 Albert Berahas Jorge Nocedal

Northwestern University Northwestern University

slide-2
SLIDE 2

2

My thanks to the organizers. Saludos y Gracias a Don Goldfarb

slide-3
SLIDE 3

3

Numerical Results

We compare our Finite Difference L-BFGS Method (FD-LM) to Model interpolation trust region method (MB) of Conn, Scheinberg, Vicente. Their method, DFOtr, is: a simple implementation not designed for fast execution does not include a geometry phase Our goal is not to determine which method “wins”. Rather 1. Show that the FD-LM method is robust 2. Show that FD-LM is not wasteful in function evaluations

slide-4
SLIDE 4

4

Adaptive Finite Difference L-BFGS Method

Estimate noise εf Compute h by forward or central differences [(4-8) function evaluations] Compute gk While convergence test not satisfied: d = −H kgk [L-BFGS procedure] (x+, f+, flag) = LineSearch(xk, fk,gk,dk, fs) IF flag=1 [line search failed] (x+, f+,h) = Recovery(xk, fk,gk,dk,maxiter) endif xk+1 = x+, fk+1 = f+ Compute gk+1 [finite differences using h] sk = xk+1 − xk, yk = gk+1 − gk Discard (sk,yk) if sk

T yk ≤ 0

k = k +1 endwhile

slide-5
SLIDE 5

5

Test problems

Plotting f (xk)−φ* vs no. of f evaluations

We show results for 4 representative problems

slide-6
SLIDE 6

6

Numerical Results – Stochastic Additive Noise

100 200 300 400 500 600

Number of function evaluations

10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3

F(x)-F*

s271 Stochastic Additive Noise:1e-02

DFOtr FDLM (FD) FDLM (CD)

50 100 150 200 250 300

Number of function evaluations

10 -3 10 -2 10 -1 10 0 10 1 10 2

F(x)-F*

s334 Stochastic Additive Noise:1e-02

DFOtr FDLM (FD) FDLM (CD)

100 200 300 400 500 600

Number of function evaluations

10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 10 4

F(x)-F*

s271 Stochastic Additive Noise:1e-08

DFOtr FDLM (FD) FDLM (CD)

50 100 150 200 250 300

Number of function evaluations

10 -8 10 -6 10 -4 10 -2 10 0 10 2

F(x)-F*

s334 Stochastic Additive Noise:1e-08

DFOtr FDLM (FD) FDLM (CD)

f (x) = φ(x)+ε(x)

ε(x) ~U(−ξ,ξ) ξ ∈[10−8,…,10−1]

slide-7
SLIDE 7

7

Numerical Results – Stochastic Additive Noise (continued)

500 1000 1500 2000 2500 3000

Number of function evaluations

10 -3 10 -2 10 -1 10 0

F(x)-F*

s289 Stochastic Additive Noise:1e-02

DFOtr FDLM (FD) FDLM (CD)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Number of function evaluations

10 -4 10 -2 10 0 10 2 10 4 10 6 10 8

F(x)-F*

s293 Stochastic Additive Noise:1e-02

DFOtr FDLM (FD) FDLM (CD)

500 1000 1500 2000 2500 3000

Number of function evaluations

10 -10 10 -8 10 -6 10 -4 10 -2 10 0

F(x)-F*

s289 Stochastic Additive Noise:1e-08

DFOtr FDLM (FD) FDLM (CD)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Number of function evaluations

10 -8 10 -6 10 -4 10 -2 10 0 10 2 10 4 10 6 10 8

F(x)-F*

s293 Stochastic Additive Noise:1e-08

DFOtr FDLM (FD) FDLM (CD)

f (x) = φ(x)+ε(x)

ε(x) ~U(−ξ,ξ) ξ ∈[10−8,…,10−1]

slide-8
SLIDE 8

8

Numerical Results – Stochastic Additive Noise – Performance Profiles

1 2 4 8 16

Performance Ratio

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

τ = 10-5 DFOtr FDLM (FD) FDLM (CD)

1 2 4

Performance Ratio

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

τ = 10-5 DFOtr FDLM (FD) FDLM (CD)

slide-9
SLIDE 9

9

Numerical Results – Stochastic Multiplicative Noise – Performance Profiles

1 2 4

Performance Ratio

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

τ = 10-5 DFOtr FDLM (FD) FDLM (CD)

1 2 4 8

Performance Ratio

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

τ = 10-5 DFOtr FDLM (FD) FDLM (CD)

slide-10
SLIDE 10

10

Numerical Results – Hybrid Method – Recovery Mechanism

  • As Jorge mentioned in Part I, our algorithm has a recovery

mechanism

  • This procedure is very important for the stable

performance of the method

  • Principle recovery mechanism is to re-estimate h
  • HYBRID METHOD: If h is acceptable, then we switch

from Forward to Central differences

slide-11
SLIDE 11

11

Numerical Results – Hybrid FC Method – Stochastic Additive Noise

50 100 150 200 250 300 350 400 450 500

Number of function evaluations

10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

F(x)-F*

s267 Stochastic Multiplicative Noise:1e-02

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

50 100 150 200 250 300

Number of function evaluations

10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3

F(x)-F*

s241 Stochastic Multiplicative Noise:1e-02

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

50 100 150 200 250 300

Number of function evaluations

10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

F(x)-F*

s246 Stochastic Additive Noise:1e-06

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

20 40 60 80 100 120 140 160 180 200

Number of function evaluations

10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2

F(x)-F*

s208 Stochastic Additive Noise:1e-08

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

slide-12
SLIDE 12

12

Numerical Results – Hybrid Method FC – Stochastic Multiplicative Noise

50 100 150 200 250 300 350 400 450 500

Number of function evaluations

10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

F(x)-F*

s267 Stochastic Multiplicative Noise:1e-02

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

50 100 150 200 250 300

Number of function evaluations

10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3

F(x)-F*

s241 Stochastic Multiplicative Noise:1e-02

DFOtr FDLM (FD) FDLM (CD) FDLM (HYBRID)

slide-13
SLIDE 13

13

Numerical Results – Conclusions

  • Both methods are fairly reliable
  • FD-LM method not wasteful in terms of function evaluations
  • No method dominates
  • Central difference appears to be more reliable, but is twice as

expensive per iteration

  • Hybrid approach shows promise
slide-14
SLIDE 14

14

Convergence analysis

1. What can we prove about the algorithm proposed here? 2. We first note that there is a theory for the Implicit Filtering Method of Kelley – which is a finite difference BFGS method

  • He establishes deterministic convergence guarantees to the solution
  • Possible because it is assumed that noise can be diminished as

needed at every iteration

  • Similar to results on Sampling methods for stochastic objctives

3. In our analysis we assume that noise does not go to zero

  • We prove convergence to a neighborhood of the solution whose

radius depends on the noise level in the function

  • Results of this type were pioneered by Nedic-Bertsekas for

incremental gradient method with constant steplengths 4. We prove two sets of results for strongly convex functions

  • Fixed steplength
  • Armijo line search

5. Up to now, little analysis of line search with noise

slide-15
SLIDE 15

15

Discussion

1. The algorithm proposed here is complex, particularly if the recovery mechanism is included 2. The effect that noisy function evaluations and finite difference gradient approximations have on the line search are difficult to analyze 3. In fact: the study of stochastic line searches is one of our current research projects 4. How should results be stated:

  • in expectation?
  • in probability?
  • what assumptions on the noise are realistic?
  • some results in the literature assume the true function value is

available

  • This field is emerging

φ(x)

slide-16
SLIDE 16

16

Context of our analysis

1. We will bypass these thorny issues by assuming that

  • Noise in the function and gradient are bounded
  • And consider a general gradient method with errors

xk+1 = xk −α kH kgk

  • gk is any approximation to the gradient
  • could stand for a finite difference approximation or some other
  • treatment is general
  • to highlight the novel aspects of this analysis we assume Hk=I

‖ε(x)‖ ≤ C f ‖e(x)‖ ≤ Cg

slide-17
SLIDE 17

17

Fixed Steplength Analysis

Iteration xk+1 = xk −αgk

Assume µI ≺ ∇2φ(xk) ≺ LI

‖e(x)‖ ≤ Cg

  • Theorem. If α <1/ L then for all k

φ(xk+1 −φ N ) ≤ (1−αµ)[φ(xk)−φ N ]

φ N ≡ φ* + Cg

2

2µ best possible objective value

Recall f (x) = φ(x)+ε(x) Define gk = ∇φ(xk)+ e(xk)

φk −φ* ≤ (1−αµ)k(φ0 −φ N )+ Cg

2

Therefore,

slide-18
SLIDE 18

18

Idea behind the proof

slide-19
SLIDE 19

19

Line Search

Our algorithm uses a line search Move away from fixed steplengths and exploit the power of line searches Very little work on noisy line searches How should sufficient decrease be defined? Introduce new Armijo condition:

where α = max{1,τ,τ 2,…} and εA > 2C f

f (xk +αdk) ≤ f (xk)+ c1αgk

Tdk +εA

slide-20
SLIDE 20

20

Line Search Analysis

New Armijo condition:

where α = max{1,τ,τ 2,…} and εA > 2C f

f (xk +αdk) ≤ f (xk)+ c1αgk

Tdk +εA

Because of relaxation term Armijo is always satisfied for alpha <<1. But how long will the step be? Consider 2 sets of iterates: Case 1: Gradient error is small relative to gradient. Step of 1/L is accepted, and good progress is made. Case 2: Gradient error is large relative to gradient. Step could be poor, but size of step is only of order Cg

slide-21
SLIDE 21

21

Line Search Analysis

Assume µI ≺ ∇2φ(xk) ≺ LI

Iteration xk+1 = xk −α kgk

‖e(x)‖ ≤ Cg and ‖ε(x)‖ ≤ C f

and φ N = φ* + 1 1− ρ [c1τ(1− β)2Cg

2

Lβ 2 +εA + 2C f ]

Theorem: Above algorithm with relaxed Armijo with c1 <1/ 2 gives φ(xk+1)−φ N ≤ ρ[φ(xk)−φ N ]

where ρ = 1− 2µc1τ(1− β)2 L

Here β is a free parameter in (0,1− 2c1 1+ 2c1 ]

slide-22
SLIDE 22

22

THANK YOU.