How to Characterize the Worst-Case Performance of Algorithms for - - PowerPoint PPT Presentation

how to characterize the worst case performance of
SMART_READER_LITE
LIVE PREVIEW

How to Characterize the Worst-Case Performance of Algorithms for - - PowerPoint PPT Presentation

Motivation Contemporary Analyses Partitioning Regularization Methods Summary How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis , Lehigh University joint work with Daniel P. Robinson ,


slide-1
SLIDE 1

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

Frank E. Curtis, Lehigh University joint work with Daniel P. Robinson, Johns Hopkins University

U.S.-Mexico Workshop on Optimization and its Applications

8 January 2018

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 1 of 32

slide-2
SLIDE 2

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Thanks, Don!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 2 of 32

slide-3
SLIDE 3

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 3 of 32

slide-4
SLIDE 4

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 4 of 32

slide-5
SLIDE 5

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

History

Nonlinear optimization has had parallel developments convexity Rockafellar Fenchel Nemirovski Nesterov subgradient inequality convergence, complexity guarantees smoothness Powell Fletcher Goldfarb Nocedal sufficient decrease convergence, fast local convergence Worlds are (finally) colliding!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 5 of 32

slide-6
SLIDE 6

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Worst-case complexity for nonconvex optimization

Here is how we do it now: Assuming Lipschitz continuity of derivatives. . . . . . upper bound on # of iterations until ∇f(xk)2 ≤ ǫ? Gradient descent Newton / trust region Cubic regularization O(ǫ−2) O(ǫ−2) O(ǫ−3/2)

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 6 of 32

slide-7
SLIDE 7

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Self-examination

  • But. . .

◮ Is this the best way to characterize our algorithms? ◮ Is this the best way to represent our algorithms? How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 7 of 32

slide-8
SLIDE 8

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Self-examination

  • But. . .

◮ Is this the best way to characterize our algorithms? ◮ Is this the best way to represent our algorithms?

People listen! Cubic regularization. . .

◮ Griewank (1981) ◮ Nesterov & Polyak (2006) ◮ Weiser, Deuflhard, Erdmann (2007) ◮ Cartis, Gould, Toint (2011), the ARC method

. . . is a framework to which researchers have been attracted. . .

◮ Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) ◮ Carmon, Duchi (2017) ◮ Kohler, Lucchi (2017) ◮ Peng, Roosta-Khorasan, Mahoney (2017)

However, there remains a large gap between theory and practice!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 7 of 32

slide-9
SLIDE 9

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Purpose of this talk

Our goal: A complementary approach to characterize algorithms.

◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32

slide-10
SLIDE 10

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Purpose of this talk

Our goal: A complementary approach to characterize algorithms.

◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate

We’re admitting: Our approach does not give the complete picture. But we believe it is useful!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32

slide-11
SLIDE 11

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Purpose of this talk

Our goal: A complementary approach to characterize algorithms.

◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate

We’re admitting: Our approach does not give the complete picture. But we believe it is useful! Nonconvexity is difficult in every sense!

◮ Can we accept a characterization strategy with some (literal) holes? ◮ Or should we be purists, even if we throw out the baby with the bathwater... How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32

slide-12
SLIDE 12

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 9 of 32

slide-13
SLIDE 13

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Simple setting

Consider the iteration xk+1 ← xk − 1

L gk for all k ∈ N.

A contemporary complexity analysis considers the set G(ǫg) := {x ∈ Rn : g(x)2 ≤ ǫg} and aims to find an upper bound on the cardinality of Kg(ǫg) := {k ∈ N : xk ∈ G(ǫg)}.

gk := ∇f(xk), g := ∇f

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 10 of 32

slide-14
SLIDE 14

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Upper bound on |Kg(ǫg)|

Using sk = − 1

L gk and the upper bound

fk+1 ≤ fk + gT

k sk + 1 2 Lsk2 2,

  • ne finds with finf := infx∈Rn f(x) that

fk − fk+1 ≥

1 2L gk2 2

= ⇒ (f0 − finf) ≥

1 2L |Kg(ǫg)|ǫ2 g

= ⇒ |Kg(ǫg)| ≤ 2L(f0 − finf)ǫ−2

g .

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 11 of 32

slide-15
SLIDE 15

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

“Nice” f

But what if f is “nice”? . . . e.g., satisfying the Polyak- Lojasiewicz condition for c ∈ (0, ∞), i.e., f(x) − finf ≤

1 2c g(x)2 2 for all x ∈ Rn.

Now consider the set F(ǫf) := {x ∈ Rn : f(x) − finf ≤ ǫf} and consider an upper bound on the cardinality of Kf(ǫf) := {k ∈ N : xk ∈ F(ǫf)}.

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 12 of 32

slide-16
SLIDE 16

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Upper bound on |Kf(ǫf)|

Using sk = − 1

L gk and the upper bound

fk+1 ≤ fk + gT

k sk + 1 2 Lsk2 2,

  • ne finds that

fk − fk+1 ≥

1 2L gk2 2

≥ c

L (fk − finf)

= ⇒ (1 − c

L )(fk − finf) ≥ fk+1 − finf

= ⇒ (1 − c

L )k(f0 − finf) ≥ fk − finf

= ⇒ |Kf(ǫf)| ≤ log f0 − finf ǫf log

  • L

L − c −1 .

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 13 of 32

slide-17
SLIDE 17

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

For the first step. . .

In the “general nonconvex” analysis. . . . . . the expected decrease for the first step is much more pessimistic: general nonconvex: f0 − f1 ≥

1 2L ǫ2 g

PL condition: (1 − c

L )(f0 − finf) ≥ f1 − finf

. . . and it remains more pessimistic throughout!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 14 of 32

slide-18
SLIDE 18

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Upper bounds on |Kf(ǫf)| versus |Kg(ǫg)|

Let f(x) = 1

2 x2, meaning that g(x) = x.

◮ Let ǫf = 1

2 ǫ2 g, meaning that F(ǫf) = G(ǫg).

◮ Let x0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 15 of 32

slide-19
SLIDE 19

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Upper bounds on |Kf(ǫf)| versus |{k ∈ N : 1

2gk2 2 > ǫg}| Let f(x) = 1

2 x2, meaning that 1 2 g(x)2 = 1 2 x2.

◮ Let ǫf = ǫg, meaning that F(ǫf) = G(ǫg). ◮ Let x0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 16 of 32

slide-20
SLIDE 20

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Bad worst-case!

Worst-case complexity bounds in the general nonconvex case are very pessimistic.

◮ The analysis immediately admits a large gap when the function is nice. ◮ The “essentially tight” examples for the worst-case bounds are. . . weird.1 1Cartis, Gould, Toint (2010) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 17 of 32

slide-21
SLIDE 21

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Plea

Let’s not have these be the problems that dictate how we

◮ characterize our algorithms and ◮ represent our algorithms to the world! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 18 of 32

slide-22
SLIDE 22

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 19 of 32

slide-23
SLIDE 23

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Motivation

We want a characterization strategy that

◮ attempts to capture behavior in actual practice ◮ i.e., is not “bogged down” by pedogogical examples ◮ can be applied consistently across different classes of functions ◮ shows more than just the worst of the worst case How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 20 of 32

slide-24
SLIDE 24

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Motivation

We want a characterization strategy that

◮ attempts to capture behavior in actual practice ◮ i.e., is not “bogged down” by pedogogical examples ◮ can be applied consistently across different classes of functions ◮ shows more than just the worst of the worst case

Our idea is to

◮ partition the search space (dependent on f and x0) ◮ analyze how an algorithm behaves over different regions ◮ characterize an algorithm’s behavior by region

For some functions, there will be holes, but for some of interest there are none!

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 20 of 32

slide-25
SLIDE 25

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Intuition

Think about an arbitrary point in the search space, i.e., L := {x ∈ Rn : f(x) ≤ f(x0)}.

◮ If g(x)2 ≫ 0, then “a lot” of progress can be made. ◮ If min(eig(∇2f(x))) ≪ 0, then “a lot” of progress can also be made. How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 21 of 32

slide-26
SLIDE 26

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Assumption

Assumption 1

◮ f is p-times continuously differentiable ◮ f is bounded below by finf := infx∈Rn f(x) ◮ for all p ∈ {1, . . . , p}, there exists Lp ∈ (0, ∞) such that

f(x + s) ≤ f(x) +

p

  • j=1

1 j! ∇jf(x)[s]j

  • tp(x,s)

+ Lp p + 1 sp+1

2

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 22 of 32

slide-27
SLIDE 27

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

pth-order term reduction

Definition 2

For each p ∈ {1, . . . , p}, define the function mp(x, s) = 1 p! ∇pf(x)[s]p + rp p + 1 sp+1

2

. Letting smp(x) := arg mins∈Rn, the reduction in the pth-order term from x is ∆mp(x) = mp(x, 0) − mp(x, smp(x)) ≥ 0. *Exact definition of rp is not complicated, but we’ll skip it here

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 23 of 32

slide-28
SLIDE 28

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Regions

We propose to partition the search space, given (κ, fref) ∈ (0, 1) × [finf, f(x0)), into R1 := {x ∈ L : ∆m1(x) ≥ κ(f(x) − fref)}, Rp := {x ∈ L : ∆mp(x) ≥ κ(f(x) − fref)} \  

p−1

  • j=1

Rj   for all p ∈ {2, . . . , p}, and R := L \  

p

  • j=1

Rj   . *We don’t need fref = finf, but, for simplicity, think of it that way here

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 24 of 32

slide-29
SLIDE 29

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Functions satisfying Polyak- Lojasiewicz

Theorem 3

A continuously differentiable f with a Lipschitz continuous gradient satisfies the Polyak- Lojasiewicz condition if and only if R1 = L for any x0 ∈ Rn. Hence, if we prove something about the behavior of an algorithm over R1, then

◮ we know how it behaves if f satisfies PL and ◮ we know how it behaves at any point satisfying the PL inequality. How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 25 of 32

slide-30
SLIDE 30

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Functions satisfying a strict-saddle-type property

Theorem 4

If f is twice-continuously differentiable with Lipschitz continuous gradient and Hessian functions such that, at all x ∈ L and for some ζ ∈ (0, ∞), one has max{∇f(x)2

2, −λmin(∇2f(x))3} ≥ ζ(f(x) − finf),

then R1 ∪ R2 = L.

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 26 of 32

slide-31
SLIDE 31

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 27 of 32

slide-32
SLIDE 32

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Linearly convergent behavior over Rp

Let swp(x) be a minimum norm global minimizer of the regularized Taylor model wp(x, s) = tp(x, s) + lp p + 1 sp+1

2

Theorem 5

If {xk} is generated by the iteration xk+1 ← xk + swp(x), then, with ǫf ∈ (0, f(x0) − fref), the number of iterations in Rp ∩ {x ∈ Rn : f(x) − fref ≥ ǫf} is bounded above by

  • log

f(x0) − fref ǫf log

  • 1

1 − κ −1 = O

  • log

f(x0) − fref ǫf

  • How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

28 of 32

slide-33
SLIDE 33

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Characterization: Contemporary

Let RG and RN represent regularized gradient and Newton, respectively.

Theorem 6

With p ≥ 2, let K1(ǫg) := {k ∈ N : ∇f(xk)2 > ǫg} and K2(ǫH) := {k ∈ N : λmin(∇2f(xk)) < −ǫH}. Then, the cardinalities of K1(ǫg) and K2(ǫH) are of the order. . . Algorithm |K1(ǫg)| |K2(ǫH)| RG O

  • l1(f(x0)−finf)

ǫ2

g

RN O

  • l1/2

2

(f(x0)−finf) ǫ3/2

g

  • O
  • l2

2(f(x0)−finf)

ǫ3

H

  • How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

29 of 32

slide-34
SLIDE 34

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Characterization: Our approach

Theorem 7

The numbers of iterations in R1 and R2 with fref = finf are of the order. . .

Algorithm R1 R2 RG O

  • log

f(x0)−finf

ǫf

RN O

  • l2

2(f(x0)−finf) r3 1

  • + O
  • log

f(x0)−finf

ǫf

  • O
  • log

f(x0)−finf

ǫf

  • There is an initial phase, as seen in Nesterov & Polyak (2006)

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 30 of 32

slide-35
SLIDE 35

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Characterization: Our approach

Theorem 7

The numbers of iterations in R1 and R2 with fref = finf are of the order. . .

Algorithm R1 R2 RG O

  • log

f(x0)−finf

ǫf

RN O

  • l2

2(f(x0)−finf) r3 1

  • + O
  • log

f(x0)−finf

ǫf

  • O
  • log

f(x0)−finf

ǫf

  • There is an initial phase, as seen in Nesterov & Polyak (2006)

A ∞ can appear, but one could consider probabilistic bounds, too

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 30 of 32

slide-36
SLIDE 36

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Outline

Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 31 of 32

slide-37
SLIDE 37

Motivation Contemporary Analyses Partitioning Regularization Methods Summary

Summary & Perspectives

Our goal: A complementary approach to characterize algorithms.

◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate

Our idea is to

◮ partition the search space (dependent on f and x0) ◮ analyze how an algorithm behaves over different regions ◮ characterize an algorithm’s behavior by region

For some functions, there are holes, but for others the characterization is complete.

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 32 of 32