Escaping from saddle points on Riemannian manifolds Yue Sun , - PowerPoint PPT Presentation

Escaping from saddle points on Riemannian manifolds Yue Sun † , Nicolas Flammarion ‡ , Maryam Fazel † † Department of Electrical and Computer Engineering, University of Washington, Seattle ‡ School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland November 8, 2019 1 / 24

Manifold constrained optimization We consider the problem f ( x ) , subject to x ∈ M minimize x Same as Euclidean space, generally we cannot find global optimum in polynomial time, so we want to find an approximate local minimum . Plot of saddle point in Euclidean Contour of function value on space. sphere. 2 / 24

Example of manifolds 1. Sphere. { x ∈ R d : � d i =1 x 2 i = r 2 } . 2. Stiefel manifold. { X ∈ R m × n : X T X = I } . 3. Grassmannian manifold. Grass ( p , n ) is set of p dimensional subspaces in R n . 4. Bruer-Monteiro relaxation. { X ∈ R m × n : diag ( X T X ) = 1 } . 3 / 24

Curve A curve in a continuous map γ : t → M . Usually t ∈ [0 , 1], where γ (0) and γ (1) are start and end points of the curve. 4 / 24

Tangent vector and tangent space We use x ∈ M can be start point of many γ ( t + τ ) − γ ( t ) curves, and a tangent space T x M is γ ( t ) = lim ˙ τ τ → 0 the set of tangent vectors at x . as the velocity of the curve, ˙ γ ( t ) is a Tangent space is a metric space. tangent vector at γ ( t ) ∈ M . 5 / 24

Gradient of a function Let f : M → R be a function defined on M , and γ be a curve on M . γ (0) is 1 The directional derivative of f in direction ˙ γ (0) f = d ( f ( γ ( t ))) f ( γ (0 + τ )) − f ( γ (0)) � ˙ t =0 = lim � τ dt τ → 0 Then we can define grad f ( x ) ∈ T x M , which satisfies � grad f , y � = yf for all y ∈ T x M . γ denotes differential operator and γ ′ denotes the tangent vector. 1 Usually ˙ They are closely related. To avoid confusion, we always use ˙ γ . 6 / 24

Vector field The gradient of a function is a special case of vector field of a manifold. A vector field is a function from a point in M to tangent vector at that point. 7 / 24

Connection Denote a smooth vector field on M by X ( M ). A connection defines directional derivative ∇ : X ( M ) × X ( M ) → X ( M ) satisfying ∇ fx + gy u = f ∇ x u + g ∇ y u , ∇ x ( au + bv ) = a ∇ x u + b ∇ x v , ∇ x ( fu ) = ( xf ) u + f ( ∇ x u ) . Note that � ( ∂ i u j e j + u j ∇ e i e j ) . ∇ e i u = j A special connection is Riemannian (Levi-Civita) connection. 8 / 24

Riemannian Hessian A directional Hessian is defined as H ( x )[ u ] = ∇ u grad f ( x ) for u ∈ T x M 2 . Similar as gradient, we can define a Hessian from directional version. � H ( x ) u , v � = �∇ u grad f ( x ) , v � , ∀ u , v ∈ T x M . It is a symmetric operator. 2 In Riemannian geometry, one writes u x to indicate u ∈ T x M and the directional Hessian is ∇ u x grad f . 9 / 24

Geodesic A geodesic is a special class of curves on the manifold, which satisfies zero acceleration condition ∇ ˙ γ ( t ) ˙ γ = 0 . 10 / 24

Exponential map For any x ∈ M , y ∈ T x M and the geodesic γ defined by y , γ (0) = x , ˙ γ (0) = y we call the mapping Exp x ( y ) : T x M → M such that Exp x ( y ) = γ (1) as exponential map. There is a neighborhood with radius I in T x M , such that for all y ∈ T x M , � y � ≤ I , exponential map is a bijection/diffeomorphism. 11 / 24

Parallel transport The parallel transport Γ transports a tangent vector w along a curve γ , satisfying the zero acceleration condition ∇ γ t w t = 0 , w t = Γ γ ( t ) γ (0) w . 12 / 24

Curvature tensor The curvature tensor describes how curved the manifold is. It relates to the second order structure of the manifold. A definition by connection is R ( x , y ) w = ∇ x ∇ y w − ∇ y ∇ x w − ∇ [ x , y ] w . where x , y , w are in tangent space of the same point 3 . 3 [ x , y ] is the Lie bracket defined by [ x , y ] f = xyf − yxf 13 / 24

Curvature tensor Γ 0 , 0 0 ,τ y Γ 0 ,τ y tx ,τ y Γ tx ,τ y tx , 0 Γ tx , 0 0 , 0 w − w R ( x , y ) w = lim t τ t ,τ → 0 14 / 24

Smooth function on Riemannian manifold We consider the manifold constrained optimization problem minimize f ( x ) , subject to x ∈ M x assuming the function and manifold satisfying 1. There is a finite constant β such that � grad f ( y ) − Γ y x grad f ( x ) � ≤ β d ( x , y ) for all x , y ∈ M . 2. There is a finite constant ρ such that � H ( y ) − Γ y x H ( x )Γ x y � 2 ≤ ρ d ( x , y ) for all x , y ∈ M . 3. There is a finite constant K such that | R ( x )[ u , v ] | ≤ K for all x ∈ M and u , v ∈ T x M . f may not be convex. 15 / 24

Taylor expansion of a smooth function For x , y ∈ E Euclidean space, f ( y ) − f ( x ) = � y − x , ∇ f ( x ) + ∇ 2 f ( x )( y − x ) � 1 ( ∇ 2 f ( x + τ ( y − x )) − ∇ 2 f ( x ))( y − x ) d τ � . + 0 For x , y ∈ M , let γ denote the geodesic where γ (0) = x , γ (1) = y , then γ (0) , grad f ( x ) + 1 f ( y ) − f ( x ) = � ˙ 2 ∇ ˙ γ (0) grad f + ∆ � � 1 where ∆ = 0 ∆( γ ( τ )) d τ , � 1 (Γ x ∆( γ ( τ )) = γ ( τ ) ∇ ˙ γ ( τ ) grad f − ∇ ˙ γ (0) grad f ) d τ. 0 16 / 24

Riemannian gradient descent On a smooth manifold, there exists a η such that, if x t +1 = Exp x t ( − grad f ( x t )) , then f ( x t +1 ) ≤ f ( x t ) − η 2 � grad f ( x t ) � 2 . Converge to first order stationary. 17 / 24

Proposed algorithm for escaping saddle Hope to escape from saddle point and converge to an approximate local minimum. 1. At iterate x , check the norm of gradient. 2. If large: do x + = Exp x ( − η grad f ( x )) to decrease function value. 3. If small: near either a saddle point or a local min. Perturb iterate by adding appropriate noise, run a few iterations. 3.1 if f decreases, iterates escape saddle point (and alg continues). 3.2 if f doesn’t decrease: at approximate local min (alg terminates). 18 / 24

Difficulty of second order analysis 1. We use linearization in first order analysis, for second order, manifold has a second order structure as well. 2. Consider power method in Euclidean space. We need to prove that the biggest eigenvector direction of x grows exponentially. 3. If it’s iteration of variable, we have to consider gradient in different tangent spaces. 4. Some recent work require strong assumptions such as flat manifold, product manifold. 5. Other recent work assume smoothness parameters of composition of function and manifold operator, which are hard to check. 19 / 24

Useful lemmas Let x ∈ M and y , a ∈ T x M . Let us denote by z = Exp x ( a ) then x y )) ≤ c ( K ) min {� a � , � y �} ( � a � + � y � ) 2 . d ( Exp x ( y + a ) , Exp z (Γ z 20 / 24

Useful lemmas Holonomy. � Γ x z Γ z y Γ y x w − w � ≤ c ( K ) d ( x , y ) d ( y , z ) � w � . Similar to definition of curvature tensor, if a vector is parallel transported around a closed curve, then the change is bounded by the area whose boundary is the curve. 21 / 24

Useful lemmas Euclidean: f ( x ) = x T Hx ⇒ x + = ( I − η H ) x . Exponential growth in a vector space. If function f is β gradient Lipschitz, ρ Hessian Lipschitz, curvature constant is bounded by K , x is a ( ǫ, −√ ˆ ρǫ ) saddle point, and define u + = Exp u ( − η grad f ( u )) and w + = Exp w ( − η grad f ( w )). If a small enough neighborhood 4 , � Exp − 1 x ( w + ) − Exp − 1 x ( u + ) − ( I − η H ( x ))( Exp − 1 x ( w ) − Exp − 1 x ( u )) � ≤ C ( K , ρ, β ) d ( u , w ) ( d ( u , w ) + d ( u , x ) + d ( w , x )) . for some explicit constant C ( K , ρ, β ). 4 Quantified in paper. 22 / 24

Theorem Theorem (Jin et al., Eucledean space) Perturbed GD converges to a ( ǫ, −√ ρǫ )-stationary point of f in � �� β ( f ( x 0 ) − f ( x ∗ )) � β d ( f ( x 0 ) − f ( x ∗ )) log 4 O ǫ 2 ǫ 2 δ iterations. We replace Hessian Lipschitz ρ by ˆ ρ as a function of ρ and K and we quantify it in the paper. Theorem (manifold) � Perturbed RGD converges to a ( ǫ, − ρ ( ρ, K ) ǫ )-stationary point of f in ˆ � �� β ( f ( x 0 ) − f ( x ∗ )) � β d ( f ( x 0 ) − f ( x ∗ )) log 4 O ǫ 2 ǫ 2 δ iterations. 23 / 24

Experiment Burer-Monteiro facotorization. Let A ∈ S d × d , the problem 6 X ∈ S d × d trace ( AX ) , max 5 Function value 4 3 s . t . diag ( X ) = 1 , X � 0 , rank ( X ) ≤ r . 2 1 can be factorized as 0 0 5 10 15 20 25 30 35 40 45 Y ∈ R d × p trace ( AYY T ) , s . t . diag ( YY T ) = 1 . Iterations max Iteration versus function value. when r ( r + 1) / 2 ≤ d , p ( p + 1) / 2 ≥ d . 24 / 24

Escaping from saddle points on Riemannian manifolds Yue Sun , - PowerPoint PPT Presentation

Escaping from saddle points on Riemannian manifolds Yue Sun , Nicolas Flammarion , Maryam Fazel Department of Electrical and Computer Engineering, University of Washington, Seattle School of Computer and Communication Sciences,

Escaping Saddle Points with Adaptive Gradient Methods Matthew Staib 1 , Sashank Reddi 2 ,

Riemannian manifolds with nontrivial Limbeek local symmetry Wouter van Limbeek University of

Escaping Saddle Points in Constant Dimensional Spaces: an Agent-based Modeling Perspective Grant

Escaping Saddle Points in Constant Dimensional Spaces: an Agent-based Modeling Perspective Grant

Conformality and Q harmonicity in sub-Riemannian manifolds Joint work L.C., Enrico Le Donne

Nonsmooth trust region methods on Riemannian manifolds S. Hosseini Institut f ur Numerische

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

The structure of the escaping set of a transcendental entire function Gwyneth Stallard (joint

Boundary rigidity of Riemannian manifolds Plamen Stefanov and Gunther Uhlmann Domain R

Some analytic and geometric aspects of the p -Laplacian on Riemannian manifolds Stefano Pigola

Hierarchical Clustering on Special Manifolds Motivation Background Manifolds Angelos Markos 1

Vector Bundle Valued Differential Forms on Non-Negatively Graded DG Manifolds Luca Vitagliano

Normal form of the metric for a class of Riemannian manifolds with ends Jean-Marc Bouclet April

On the Smallest Enclosing Riemannian Balls On Approximating the Riemannian 1-Center

D U E o i r ud ig el it i R o e t Riemannian Holonomy. To a Riemannian manifold ( M n

On some variational problems in Riemannian and Fractal Geometry Giovanni Molica Bisci

Security I retired slides Markus Kuhn Computer Laboratory, University of Cambridge

Protecting TLS from Legacy Crypto http://mitls.org Karthikeyan Bhargavan + + many, many other

Timestamp /16 at LBL, sampled 1-in-1K 2nd /16, sampled 1-in-1K Number of relays 8000 6000

Public-Key Cryptography Public-Key Cryptography Lecture 9 Public-Key Cryptography Lecture 9 El

Stokes waves with constant vorticity: Numerical computation Vera Mikyoung Hur with Sergey

An Introduc tion to We b Eng ine e ring We e k 1 Syllabus Syllabus

Q2 Fiscal Year 2021 Financial Review 1 | DocuSign PUBLIC DocuSign PUBLIC DocuSign PUBLIC Safe

Reducing Metadata Leakage from Encrypted Files and Communication with PURBs Kirill Nikitin * ,