A Dynamical Systems Perspective on Nesterov Acceleration Michael Muehlebach and Michael I. Jordan UC Berkeley Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 1 / 7
Introduction Find x ∗ ∈ R n such that f ( x ∗ ) ≤ f ( x ) for all x ∈ R n , where f is smooth and convex. Focus on the case where f is strongly convex, i.e. f is convex and satisfies, for any x ∈ R n , ¯ x ) + L x | 2 , ∀ x ∈ R n . f ( x ) ≥ f (¯ x ) + ∇ f (¯ x )( x − ¯ 2 κ | x − ¯ L > 0 is the Lipschitz constant of the gradient. κ ≥ 1 is the condition number. Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 2 / 7
Dynamical Systems Perspective Consider the ordinary differential equation (ODE) ∇ f ( x ) x ( t ) + 1 ¨ x ( t ) + 2 d ˙ L ∇ f ( x ( t ) + β ˙ x ( t )) = 0 , with f NP √ κ − 1 1 √ κ + 1 , √ κ + 1 . d := β := The ODE can be brought to the form p ( t ) = − 1 q ( t ) = p ( t ) , ˙ ˙ L ∇ f ( q ( t )) + f NP ( q ( t ) , p ( t )) , where H ( q, p ) := 1 2 | p | 2 + 1 f NP ( q, p ) := − 2 dp − 1 Lf ( q ) , L ( ∇ f ( q + βp ) − ∇ f ( q )) . Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 3 / 7
Damping The non-potential forces can be rewritten as ∇ f ( x ) f NP ( q, p ) = − 2 dp − 1 L ( ∇ f ( q + βp ) − ∇ f ( q )) f NP � β − 1 = − 2 dp ∆ f ( q + τp )d τ p . L � �� � 0 � �� � isotropic curv. dependent damping damping 1 1 2 d β 0 . 8 0 . 8 0 . 6 0 . 6 2 d β 0 . 4 0 . 4 0 . 2 0 . 2 0 0 0 20 40 60 80 100 0 20 40 60 80 100 κ κ Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 4 / 7
Convergence Asymptotic stability (through dissipation). Convergence rate (upper bound, stated for p (0) = 0 ) f ( q ( t )) ≤ 2( f ( q (0)) − f ∗ ) exp( − 1 / (2 √ κ ) t ) , ∀ t ∈ [0 , ∞ ) . Convergence rate of O (1 /t 2 ) in the non-strongly convex case. Derivation is based on the following Lyapunov-like function (stated for x ∗ = f ( x ∗ ) = 0 ) V ( t ) = 1 2 | aq ( t ) + p ( t ) | 2 + 1 Lf ( q ( t )) . Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 5 / 7
Discretization Semi-implicit Euler discretization (with time step T s = 1 ) leads to the accelerated gradient method q k +1 = q k + T s p k +1 , p k +1 = p k + T s ( −∇ f ( q k ) − f NP ( q k , p k )) . What are the properties that are preserved through the discretization? ◮ phase-space area contraction rate (contraction for T s ∈ (0 , 2) ) ◮ time-reversibility (for T s ∈ (0 , 1) ) ⇒ yields a worst-case bound on the convergence rate p k p k +1 ◮ convergence rate (for T s ∈ (0 , 1] ) ψ ∂ Γ k ∂ Γ k +1 Γ k Γ k +1 q k +1 q k Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 6 / 7
Conclusion and Outlook We derived a dynamical system model for the accelerated gradient method. ◮ The dynamics have an interpretation as mass-spring-damper system. ◮ Discretization yields the accelerated gradient method. ◮ Certain key properties are preserved through the discretization. Is a symplectic discretization the “right” discretization? ◮ The behavior for large κ seems particularly important. Come to visit me at Poster 205. Michael Muehlebach and Michael I. Jordan Dynamical Systems Perspective 7 / 7
Recommend
More recommend