A Dynamical Systems Perspective on Nesterov Acceleration Michael - - PowerPoint PPT Presentation

a dynamical systems perspective on nesterov acceleration
SMART_READER_LITE
LIVE PREVIEW

A Dynamical Systems Perspective on Nesterov Acceleration Michael - - PowerPoint PPT Presentation

A Dynamical Systems Perspective on Nesterov Acceleration Michael Muehlebach and Michael I. Jordan UC Berkeley Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 1 / 7 Introduction Find x R n such that f ( x )


slide-1
SLIDE 1

A Dynamical Systems Perspective on Nesterov Acceleration

Michael Muehlebach and Michael I. Jordan UC Berkeley

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 1 / 7

slide-2
SLIDE 2

Introduction

Find x∗ ∈ Rn such that f(x∗) ≤ f(x) for all x ∈ Rn, where f is smooth and convex. Focus on the case where f is strongly convex, i.e. f is convex and satisfies, for any ¯ x ∈ Rn, f(x) ≥ f(¯ x) + ∇f(¯ x)(x − ¯ x) + L 2κ|x − ¯ x|2, ∀x ∈ Rn. L > 0 is the Lipschitz constant of the gradient. κ ≥ 1 is the condition number.

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 2 / 7

slide-3
SLIDE 3

Dynamical Systems Perspective

Consider the ordinary differential equation (ODE) ¨ x(t) + 2d ˙ x(t) + 1 L∇f(x(t) + β ˙ x(t)) = 0, with d := 1 √κ + 1, β := √κ − 1 √κ + 1. The ODE can be brought to the form ˙ q(t) = p(t), ˙ p(t) = − 1 L∇f(q(t)) + fNP(q(t), p(t)), where H(q, p) := 1 2|p|2 + 1 Lf(q), fNP(q, p) := −2dp − 1 L(∇f(q + βp) − ∇f(q)). ∇f(x) fNP

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 3 / 7

slide-4
SLIDE 4

Damping

The non-potential forces can be rewritten as fNP(q, p) = −2dp − 1 L(∇f(q + βp) − ∇f(q)) = −2dp

isotropic damping

− 1 L

β

∆f(q + τp)dτ p

  • curv. dependent damping

. 20 40 60 80 100 0.2 0.4 0.6 0.8 1 κ 2d 2d 20 40 60 80 100 0.2 0.4 0.6 0.8 1 κ β β ∇f(x) fNP

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 4 / 7

slide-5
SLIDE 5

Convergence

Asymptotic stability (through dissipation). Convergence rate (upper bound, stated for p(0) = 0) f(q(t)) ≤ 2(f(q(0)) − f∗) exp(−1/(2√κ)t), ∀t ∈ [0, ∞). Convergence rate of O(1/t2) in the non-strongly convex case. Derivation is based on the following Lyapunov-like function (stated for x∗ = f(x∗) = 0) V (t) = 1 2|aq(t) + p(t)|2 + 1 Lf(q(t)).

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 5 / 7

slide-6
SLIDE 6

Discretization

Semi-implicit Euler discretization (with time step Ts = 1) leads to the accelerated gradient method qk+1 = qk + Tspk+1, pk+1 = pk + Ts(−∇f(qk) − fNP(qk, pk)). What are the properties that are preserved through the discretization?

◮ phase-space area contraction rate (contraction for Ts ∈ (0, 2)) ◮ time-reversibility (for Ts ∈ (0, 1))

⇒ yields a worst-case bound on the convergence rate

◮ convergence rate (for Ts ∈ (0, 1])

pk qk ∂Γk Γk pk+1 qk+1 ∂Γk+1 Γk+1 ψ

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 6 / 7

slide-7
SLIDE 7

Conclusion and Outlook

We derived a dynamical system model for the accelerated gradient method.

◮ The dynamics have an interpretation as mass-spring-damper system. ◮ Discretization yields the accelerated gradient method. ◮ Certain key properties are preserved through the discretization.

Is a symplectic discretization the “right” discretization?

◮ The behavior for large κ seems particularly important.

Come to visit me at Poster 205.

Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 7 / 7