 
              From Nesterov’s Estimate Sequence To Riemannian Acceleration Kwangjun Ahn, Suvrit Sra COLT 2020 arXiv: https://arxiv.org/abs/2001.08876
Riemannian Optimization? 𝑔: ℝ � → ℝ • (Euclidean) Optimization: 𝑦∈ℝ � 𝑔(𝑦) min • Riemannian Optimization: 𝑔: 𝑁 → ℝ min 𝑦∈𝑁 𝑔(𝑦) 𝑁 = a Riemannian manifold
Accel. Gradient Method! • Yurii Nesterov 80�s Accel. Gradient Descent: For 𝑢 � 0,1,2, … 𝑦 ��1 � 𝑧 � � 𝛽 ��1 𝑨 � � 𝑧 � 𝑧 ��1 � 𝑦 ��1 � 𝛿 ��1 𝛼𝑔 𝑦 ��1 𝑨 ��1 � 𝑦 ��1 � 𝛾 ��1 𝑨 � � 𝑦 ��1 � 𝜃 ��1 𝛼𝑔�𝑦 ��1 �
Accel. Gradient Method: Theory • Yurii Nesterov 80�s Nesterov showed: C.f. Gradient Descent: For 𝜈 ≼ 𝛼 � 𝑔 𝑦 ≼ 𝑀 For 𝜈 ≼ 𝛼 � 𝑔 𝑦 ≼ 𝑀 � � � � 𝑔 𝑦 � � 𝑔 𝑦 ∗ � 𝑃 1 � 𝑔 𝑧 � � 𝑔 𝑦 ∗ � 𝑃 1 � 𝑴 𝑴 For 𝜗 -approx. solution, For 𝜗 -approx. solution, 𝑴 � We need 𝑃 many iterations. � log 𝑴 � We only need t � 𝑃 . � log � � � Acceleration! � (and indeed optimal for this class!)
Natural Question.. � Could we develop such landmark result for curved spaces (Riem. manifolds)? � Turns out to be challenging question: � Li� et al.�17 ( NIPS ) reduces the task to solving nonlinear equations. � Not clear whether whether these equations are even feasible or tractably solvable. � Alimisis et al.�20 ( AISTATS ): Continuous dynamic approach � Not clear whether the discretization yields accel. � Most concrete result: Zhang- Sra�18 ( COLT ) � proposed an alg. guaranteed to accel. locally . Global accel ? � Open!
Challenge! • Nesterov�s analysis is called the Estimate Sequence technique • Nesterov�s analysis relies on linear structure! � not clear if it generalizes to non-linear space like Riem. manifolds. • Nesterov�s analysis entails non-trivial algebraic tricks! � Hard to understand; its scope has puzzled researchers for years.
Riemannian Accel. GD (Euclidean) Accel. Gradient Descent: 𝑦 𝑢+1 � 𝑧 𝑢 � 𝛽 𝑢+1 𝑨 𝑢 � 𝑧 𝑢 𝑧 𝑢+1 � 𝑦 𝑢+1 � 𝛿 𝑢+1 𝛼𝑔 𝑦 𝑢+1 𝑨 𝑢+1 � 𝑦 𝑢+1 � 𝛾 𝑢+1 𝑨 𝑢 � 𝑦 𝑢+1 � 𝜃 𝑢+1 𝛼𝑔�𝑦 𝑢+1 � Riemannian Accel. Gradient Descent: −1 𝑨 𝑢 𝑦 𝑢+1 � 𝐹𝑦𝑞 � � 𝛽 𝑢+1 ⋅ 𝐹𝑦𝑞 � � 𝑧 𝑢+1 � 𝐹𝑦𝑞 � ��� �𝛿 𝑢+1 ⋅ 𝛼𝑔 𝑦 𝑢+1 −1 𝑨 𝑢+1 � 𝐹𝑦𝑞 � ��� 𝛾 𝑢+1 ⋅ 𝐹𝑦𝑞 � ��� 𝑨 𝑢 � 𝜃 𝑢+1 𝛼𝑔 𝑦 𝑢+1 Space is curved, causes “distortion”
1.How does this affect the convergence rate? • Non-linear recursive relation: � ��� �� ��� −�/�� 𝟐 2 Severer the distortion gets, � 𝜺 𝜊 � �1−� ��� � Slower the convergence rate becomes! 𝜐 𝑣 � 𝑣�𝑣 � 𝜈/𝑀� 1 � 𝑣 No matter how severe the distortion Riem. AGD always faster than RGD! 𝜄 𝑣 � 𝑣 2 1 𝜄 𝑣 � 1 To achieve full accel. i.e. 𝜈/𝑀 , 2 𝑣 2 we need bring 𝜀 down to 1 ! 𝜄 𝑣 � 1 5 𝑣 2 𝜀 � 5 𝜈/𝑀 𝜈/𝑀 1 How do we control/estimate the distortion?
Global Accel for Riem. Case! Thm 2. Th . Given: 𝜊 0 � 0 the magnitude of metric distortion Find 𝜊 𝑢+1 ∈ �2𝜈Δ, 1� such that at iteration t 𝜊 𝑢+1 �𝜊 𝑢+1 −2𝜈Δ� 𝟐 2 � 𝜺 𝒖+𝟐 𝜊 𝑢 �1−𝜊 𝑢+1 � where 𝜺 𝒖+𝟐 � 𝑼�𝒆�𝒚 𝒖 , 𝒜 𝒖 �� for some computable function 𝑈 . 𝑔 𝑧 𝑢+1 � 𝑔 𝑦 ∗ � 𝑃 1 � 𝜊 1 1 � 𝜊 2 ⋯ 1 � 𝜊 𝑢+1 s.t. 1 𝜊 𝑢 � 𝜈/𝑀 for all 𝑢 . (2) 𝜊 𝑢 quickly converges to 𝜈/𝑀 . quickly acheives 𝐠𝐯𝐦𝐦 acceleartion! strictly 𝐠𝐛𝐭𝐮𝐟𝐬 than nonaccel GD!
Open problem Obtaining acceleration the non-strongly convex case? Remarks ★ Using strongly convex perturbation can be done ★ But, extra factor O (log 1/ ϵ ) ★ More crucially, our current proof needs to ensure all iterates remain within a set of specific size to be able to ensure acceleration. Removing this limitation is valuable
Recommend
More recommend