robust control for analysis and design of large-scale optimization - - PowerPoint PPT Presentation

robust control for analysis and design of large scale
SMART_READER_LITE
LIVE PREVIEW

robust control for analysis and design of large-scale optimization - - PowerPoint PPT Presentation

robust control for analysis and design of large-scale optimization algorithms Laurent Lessard University of WisconsinMadison Joint work with Ben Recht and Andy Packard LCCC Workshop on Large-Scale and Distributed Optimization Lund


slide-1
SLIDE 1

robust control for analysis and design

  • f large-scale optimization algorithms

Laurent Lessard

University of Wisconsin–Madison Joint work with Ben Recht and Andy Packard LCCC Workshop on Large-Scale and Distributed Optimization Lund University, June 15, 2017

slide-2
SLIDE 2
  • 1. Many algorithms can be viewed as dynamical systems

with feedback (control systems!). algorithm convergence ⇐ ⇒ system stability

  • 2. By solving a small convex program, we can recover

state-of-the-art convergence results for these algorithms, automatically and efficiently.

  • 3. The ultimate goal: to move from analysis to design.

2

slide-3
SLIDE 3

Unconstrained optimization: minimize f(x) subject to x ∈ RN

  • need algorithms that are fast and simple
  • currently favored family: first-order methods

3

slide-4
SLIDE 4

Gradient method

xk+1 = xk − α∇ f(xk)

Heavy ball method

xk+1 = xk − α∇ f(xk) + β(xk − xk−1)

Nesterov’s accelerated method

yk = xk + β(xk − xk−1) xk+1 = yk − α∇ f(yk)

x0 x1

contours of f(x) (quadratic)

20 40 60 80 100 10−2 10−4 Error

4

slide-5
SLIDE 5

Robust algorithm selection G ∈ G : algorithm we’re going to use f ∈ S : function we’d like to minimize Gopt = arg min

G∈G

  • max

f∈S cost(f, G)

  • Similar problem for a finite number of iterations:
  • Drori, Teboulle (2012)
  • Taylor, Hendrickx, Glineur (2016)

5

slide-6
SLIDE 6

G ∈ G                          Gradient method xk+1 = xk − α∇ f(xk) Heavy ball method xk+1 = xk − α∇ f(xk) + β(xk − xk−1) Nesterov’s accelerated method xk+1 = xk − α∇ f

  • xk + β(xk − xk−1)
  • + β(xk − xk−1)

f ∈ S        Analytically solvable: Quadratic functions: f(x) = 1

2xTQx − pTx

with the constraint: mI Q LI

6

slide-7
SLIDE 7

100 101 102 103 104 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ Convergence rate on quadratic functions Gradient (quadratic) Nesterov (quadratic) Heavy ball (quadratic) 100 101 102 103 104 100 101 102 103 1 1/2 Condition ratio L/m Iterations to convergence Iterations to convergence for Gradient method

Convergence rate : xk − x⋆ ≤ Cρkx0 − x⋆ Iterations to convergence ∝ − 1 log ρ

7

slide-8
SLIDE 8

Robust algorithm selection G ∈ G : algorithm we’re going to use f ∈ S : function we’d like to minimize Gopt = arg min

G∈G

  • max

f∈S cost(f, G)

  • 1. mathematical representation for G
  • 2. mathematical representation for S
  • 3. main robustness result

8

slide-9
SLIDE 9

Dynamical system interpretation Heavy ball: xk+1 = xk − α∇ f(xk) + β(xk − xk−1) Define uk := ∇ f(xk) and pk := xk−1

xk+1 pk+1

  • =

(1 + β)I −βI I xk pk

  • +

−αI

  • uk

yk =

  • I

xk pk

  • uk = ∇

f(yk) y u

algorithm (linear, known, decoupled) function (nonlinear, uncertain, coupled)

9

slide-10
SLIDE 10

Dynamical system interpretation Heavy ball: xk+1 = xk − α∇ f(xk) + β(xk − xk−1) Define uk := ∇ f(xk) and pk := xk−1

(xk+1)i (pk+1)i

  • =

1 + β −β 1 (xk)i (pk)i

  • +

−α

  • (uk)i

(yk)i =

  • 1

(xk)i (pk)i

  • (xk+1)i

(pk+1)i

  • =

1 + β −β 1 (xk)i (pk)i

  • +

−α

  • (uk)i

(yk)i =

  • 1

(xk)i (pk)i

  • (xk+1)i

(pk+1)i

  • =

1 + β −β 1 (xk)i (pk)i

  • +

−α

  • (uk)i

(yk)i =

  • 1

(xk)i (pk)i

  • uk = ∇

f(yk) y u

algorithm (linear, known, decoupled) function (nonlinear, uncertain, coupled)

(xk+1)i (pk+1)i

  • =

1 + β −β 1 (xk)i (pk)i

  • +

−α

  • (uk)i

(yk)i =

  • 1

(xk)i (pk)i

  • i = 1, . . . , N

10

slide-11
SLIDE 11

G ξ ∇ f u y ξk+1 = Aξk + Buk yk = Cξk uk = ∇ f(yk)

  • A

B C

  • =

                        

  • 1

−α 1

  • Gradient

 

1 + β −β −α 1 1

  Heavy ball  

1 + β −β −α 1 1 + β −β

  Nesterov

11

slide-12
SLIDE 12

∇ f y u

∇ f(x) x ∇ f(x) x ∇ f(x) x

⊂ ⊂

∇ f(x) :

linear sector bounded + slope restricted sector bounded

  • f(x)

x f(x) x f(x) x

⊂ ⊂

f(x) :

quadratic strongly convex + Lipschitz gradients radially quasiconvex

  • 12
slide-13
SLIDE 13

∇ f y u

Representing function classes express as quadratic constraints on (y, u)

uk yk sector bounded

∇ f is a passive function: ukyk ≥ 0

13

slide-14
SLIDE 14

∇ f y u

Representing function classes express as quadratic constraints on (y, u)

m L uk yk sector bounded

∇ f is sector-bounded: yk uk T −2mL m + L m + L −2 yk uk

  • ≥ 0

14

slide-15
SLIDE 15

∇ f y u

Representing function classes express as quadratic constraints on (y, u)

m L uk yk sector bounded + slope restricted

∇ f is sector-bounded + slope-restricted: constraint on (yk, uk) depends on history (y0, . . . , yk−1, u0, . . . , uk−1).

15

slide-16
SLIDE 16

∇ f y u Ψ

ζ

z

Introduce extra dynamics

  • Design dynamics Ψ and multiplier matrix M.
  • Instead of using q(uk, yk), use zT

k Mzk.

  • Systematic way of doing this for strong convexity

via Zames-Falb multipliers (1968).

  • General theory: Integral Quadratic Constraints

(Megretski & Rantzer 1997)

16

slide-17
SLIDE 17

G ∇ f u y

  • 1

−α 1

  • Gradient

1 + β

−β −α 1 1

  • Heavy ball

1 + β

−β −α 1 1 + β −β

  • Nesterov

                 A

B C

f(x) x ∇ f(x) x ∇ f(x) x

⊂ ⊂

f is quadratic f is strongly convex f is quasiconvex

  • (Ψ, M)

17

slide-18
SLIDE 18

Main result

Problem data:

  • G (the algorithm)
  • Ψ (what we know about f)

Auxiliary quantities:

  • Compute ( ˆ

A, ˆ B, ˆ C, ˆ D) matrices from (G, Ψ)

  • Choose a candidate rate 0 < ρ < 1.

If there exists P ≻ 0 such that ˆ ATP ˆ A − ρ2P ˆ ATP ˆ B ˆ BTP ˆ A ˆ BTP ˆ B

  • +

ˆ C ˆ D T M ˆ C ˆ D

  • then xk − x⋆ ≤
  • cond(P) ρk x0 − x⋆ for all k.

Size of LMI does not grow with problem dimension! e.g. P ∈ S3×3, LMI ∈ S4×4

18

slide-19
SLIDE 19

main results: analytic and numerical

19

slide-20
SLIDE 20

Gradient method

xk+1 = xk − α∇ f(xk)

100 101 102 103 104 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ Convergence rate for Gradient method Gradient (all functions) Nesterov (quadratic) Heavy ball (quadratic) 100 101 102 103 104 100 101 102 103 1 1/2 Condition ratio L/m Iterations to convergence Iterations to convergence for Gradient method

analytic solution! Same rate for: quadratics, strongly convex, or quasiconvex functions.

20

slide-21
SLIDE 21

Nesterov’s method

xk+1 = xk − α∇ f(xk + β(xk − xk−1)) + β(xk − xk−1)

100 101 102 103 104 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ Nesterov rate bounds IQC (quasiconvex) IQC (strongly convex) Nesterov (strongly convex) Nesterov (quadratic) Heavy ball (quadratic) 100 101 102 103 104 10−1 100 101 102 103 Condition ratio L/m Iterations to convergence Nesterov iterations

  • Cannot certify stability for quasiconvex functions
  • IQC bound improves upon best known bound!

21

slide-22
SLIDE 22

Heavy ball method

xk+1 = xk − α∇ f(xk) + β(xk − xk−1)

100 101 102 103 104 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ Heavy ball rate bounds IQC (quasiconvex) IQC (strongly convex) Nesterov (quadratic) Heavy ball (quadratic) 100 101 102 103 104 10−1 100 101 102 103 Condition ratio L/m Iterations to convergence Heavy ball iterations

  • Cannot certify stability for quasiconvex functions
  • Cannot certify stability for strongly convex functions

22

slide-23
SLIDE 23

The heavy ball method is not stable! counterexample: f(x) =     

25 2 x2

x < 1

1 2x2 + 24x − 12

1 ≤ x < 2

25 2 x2 − 24x + 36

x ≥ 2 and start the heavy ball iteration at x0 = x1 ∈ [3.07, 3.46].

−2 2 4 20 40 60 80 f(x)

  • L/m = 25
  • heavy ball iterations

converge to a limit cycle

  • simple counterexample to

the Aizerman (1949) and Kalman (1957) conjectures

23

slide-24
SLIDE 24

uncharted territory: noise robustness and algorithm design

24

slide-25
SLIDE 25

Noise robustness

G

ξ

∇ f ∆δ w u y

The ∆δ block is uncertain multiplicative noise: uk − wk ≤ δwk How does an algorithm perform in the presence of noise?

25

slide-26
SLIDE 26

Gradient method, α =

2 L+m (optimal stepsize with no noise)

100 101 102 103 0.2 0.4 0.6 0.8 1 Convergence rate ρ

Rates for different δ δ ∈ {.01, .02, .05, .1, .2, .5} Gradient method 100 101 102 103 10−1 100 101 102 103 Iterations to convergence Iterations for different δ δ ∈ {.01, .02, .05, .1, .2, .5} Gradient method

Gradient method, α = 1

L (more conservative stepsize)

100 101 102 103 0.2 0.4 0.6 0.8 1 Convergence rate ρ

Rates for different δ δ ∈ {.01, .02, .05, .1, .2, .5} Gradient method 100 101 102 103 10−1 100 101 102 103 Iterations to convergence Iterations for different δ δ ∈ {.01, .02, .05, .1, .2, .5} Gradient method

26

slide-27
SLIDE 27

Nesterov’s method (strongly convex f, with noise)

100 101 102 103 104 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ

Rates for different δ δ ∈ {.05, .1, .2, .3, .4, .5} Nesterov (quadratic) 100 101 102 103 104 10−1 100 101 102 103 Condition ratio L/m Iterations to convergence Iterations for different δ δ ∈ {.05, .1, .2, .3, .4, .5} Nesterov (quadratic)

  • Nesterov’s method is not robust to noise.

can we have it all? (robustness AND performance)

27

slide-28
SLIDE 28

Brute force approach

  • test all strictly proper G of degree 2
  • parameterization in terms of (α, β, η):

xk+1 = xk − α∇ f

  • yk
  • + β(xk − xk−1)

yk = xk + η(xk − xk−1) Special cases: (α, β, η) =      (α, 0, 0) Gradient (α, β, 0) Heavy ball (α, β, β) Nesterov

28

slide-29
SLIDE 29

Optimal designs over (α, β, η)

100 101 102 103 0.2 0.4 0.6 0.8 1 Condition ratio L/m Convergence rate ρ

Rates for different δ δ ∈ {.01, .1, .2, .3, .4, .5} Nesterov (quadratic) 100 101 102 103 10−1 100 101 102 103 Condition ratio L/m Iterations to convergence Iterations for different δ δ ∈ {.01, .1, .2, .3, .4, .5} Gradient Nesterov (quadratic)

  • Faster than the gradient method and

more robust to noise than Nesterov’s method

  • automatic algorithm design is possible!

29

slide-30
SLIDE 30

What we have (so far!)

L, Recht, Packard (SIOPT’16)

  • unified framework for algorithm analysis
  • read this one first!

Nishihara, L, Recht, Packard, Jordan (ICML’15)

  • operator splitting methods
  • application to ADMM tuning

G ∇ f ∇ g x proxλg(x) λ ∂g

+ −

30

slide-31
SLIDE 31

Recent works

Boczar, L, Packard, Recht (arXiv:1706.01337)

  • control theory treatment
  • certifying exponential convergence with IQCs

Hu, L (arXiv:1706.04381)

  • (energy) dissipation inequalities
  • prove linear rates, 1/k, and 1/k2 rates
  • Lyapunov function for (time-varying) Nesterov’s method

31

slide-32
SLIDE 32

Thank you!

  • Manuscripts + code available:

www.laurentlessard.com

  • If you’re interested, come talk to me!

32