From Nesterovs Estimate Sequence To Riemannian Acceleration - - PowerPoint PPT Presentation

โ–ถ
from nesterov s estimate sequence to riemannian
SMART_READER_LITE
LIVE PREVIEW

From Nesterovs Estimate Sequence To Riemannian Acceleration - - PowerPoint PPT Presentation

From Nesterovs Estimate Sequence To Riemannian Acceleration Kwangjun Ahn, Suvrit Sra COLT 2020 arXiv: https://arxiv.org/abs/2001.08876 Riemannian Optimization? : (Euclidean) Optimization: ()


slide-1
SLIDE 1

From Nesterovโ€™s Estimate Sequence To Riemannian Acceleration

Kwangjun Ahn, Suvrit Sra COLT 2020 arXiv: https://arxiv.org/abs/2001.08876

slide-2
SLIDE 2

Riemannian Optimization?

  • (Euclidean) Optimization:
  • Riemannian Optimization:

min

๐‘ฆโˆˆโ„ ๐‘”(๐‘ฆ)

min

๐‘ฆโˆˆ๐‘ ๐‘”(๐‘ฆ)

๐‘= a Riemannian manifold

๐‘”: โ„ โ†’ โ„ ๐‘”: ๐‘ โ†’ โ„

slide-3
SLIDE 3
  • Accel. Gradient Method!
  • Yurii Nesterov 80s
  • Accel. Gradient Descent:

For ๐‘ข 0,1,2, โ€ฆ ๐‘ฆ1 ๐‘ง ๐›ฝ1 ๐‘จ ๐‘ง ๐‘ง1 ๐‘ฆ1 ๐›ฟ1๐›ผ๐‘” ๐‘ฆ1 ๐‘จ1 ๐‘ฆ1 ๐›พ1 ๐‘จ ๐‘ฆ1 ๐œƒ1๐›ผ๐‘”๐‘ฆ1

slide-4
SLIDE 4
  • Accel. Gradient Method: Theory
  • Yurii Nesterov 80s

Nesterov showed: ๐‘” ๐‘ง ๐‘” ๐‘ฆโˆ— ๐‘ƒ 1

  • ๐‘ด
  • For ๐œ—-approx. solution,

We only need t ๐‘ƒ

๐‘ด log

  • .

Acceleration! For ๐œˆ โ‰ผ ๐›ผ๐‘” ๐‘ฆ โ‰ผ ๐‘€ (and indeed optimal for this class!) For ๐œˆ โ‰ผ ๐›ผ๐‘” ๐‘ฆ โ‰ผ ๐‘€ ๐‘” ๐‘ฆ ๐‘” ๐‘ฆโˆ— ๐‘ƒ 1

  • ๐‘ด
  • For ๐œ—-approx. solution,

We need ๐‘ƒ

๐‘ด log

  • many iterations.

C.f. Gradient Descent:

slide-5
SLIDE 5

Natural Question..

Could we develop such landmark result for curved spaces (Riem. manifolds)? Turns out to be challenging question:

Li et al.17 (NIPS) reduces the task to solving nonlinear equations.

Not clear whether whether these equations are even feasible or tractably solvable.

Alimisis et al.20 (AISTATS): Continuous dynamic approach

Not clear whether the discretization yields accel.

Most concrete result: Zhang-Sra18 (COLT) proposed an alg. guaranteed to accel. locally. Global accel? Open!

slide-6
SLIDE 6

Challenge!

  • Nesterovs analysis is called the Estimate Sequence

technique

  • Nesterovs analysis relies on linear structure!

not clear if it generalizes to non-linear space like Riem. manifolds.

  • Nesterovs analysis entails non-trivial algebraic tricks!

Hard to understand; its scope has puzzled researchers for years.

slide-7
SLIDE 7

Riemannian Accel. GD

(Euclidean) Accel. Gradient Descent:

๐‘ฆ๐‘ข+1 ๐‘ง๐‘ข ๐›ฝ๐‘ข+1 ๐‘จ๐‘ข ๐‘ง๐‘ข ๐‘ง๐‘ข+1 ๐‘ฆ๐‘ข+1 ๐›ฟ๐‘ข+1๐›ผ๐‘” ๐‘ฆ๐‘ข+1 ๐‘จ๐‘ข+1 ๐‘ฆ๐‘ข+1 ๐›พ๐‘ข+1 ๐‘จ๐‘ข ๐‘ฆ๐‘ข+1 ๐œƒ๐‘ข+1๐›ผ๐‘”๐‘ฆ๐‘ข+1

Riemannian Accel. Gradient Descent:

๐‘ฆ๐‘ข+1 ๐น๐‘ฆ๐‘ž ๐›ฝ๐‘ข+1 โ‹… ๐น๐‘ฆ๐‘ž

โˆ’1 ๐‘จ๐‘ข

๐‘ง๐‘ข+1 ๐น๐‘ฆ๐‘ž ๐›ฟ๐‘ข+1 โ‹… ๐›ผ๐‘” ๐‘ฆ๐‘ข+1 ๐‘จ๐‘ข+1 ๐น๐‘ฆ๐‘ž ๐›พ๐‘ข+1 โ‹… ๐น๐‘ฆ๐‘ž

โˆ’1

๐‘จ๐‘ข ๐œƒ๐‘ข+1๐›ผ๐‘” ๐‘ฆ๐‘ข+1

Space is curved, causes โ€œdistortionโ€

slide-8
SLIDE 8

1.How does this affect the convergence rate?

  • Non-linear recursive relation:

โˆ’/ 1โˆ’

  • ๐Ÿ

๐œบ ๐œŠ 2

๐œˆ/๐‘€ 1

๐œ ๐‘ฃ ๐‘ฃ๐‘ฃ ๐œˆ/๐‘€ 1 ๐‘ฃ ๐œ„ ๐‘ฃ ๐‘ฃ2

1 ๐œˆ/๐‘€

๐œ„ ๐‘ฃ 1 2 ๐‘ฃ2 ๐œ„ ๐‘ฃ 1 5 ๐‘ฃ2 ๐œ€ 5

Severer the distortion gets, Slower the convergence rate becomes! No matter how severe the distortion

  • Riem. AGD always faster than RGD!

To achieve full accel. i.e. ๐œˆ/๐‘€, we need bring ๐œ€ down to 1! How do we control/estimate the distortion?

slide-9
SLIDE 9

Global Accel for Riem. Case! Th Thm 2. . Given: ๐œŠ0 0

๐œŠ๐‘ข+1๐œŠ๐‘ข+1โˆ’2๐œˆฮ” 1โˆ’๐œŠ๐‘ข+1

  • ๐Ÿ

๐œบ๐’–+๐Ÿ ๐œŠ๐‘ข 2

Find ๐œŠ๐‘ข+1 โˆˆ 2๐œˆฮ”, 1 such that

๐‘” ๐‘ง๐‘ข+1 ๐‘” ๐‘ฆโˆ— ๐‘ƒ 1 ๐œŠ1 1 ๐œŠ2 โ‹ฏ 1 ๐œŠ๐‘ข+1

the magnitude of metric distortion at iteration t

where ๐œบ๐’–+๐Ÿ ๐‘ผ๐’†๐’š๐’–, ๐’œ๐’– for some computable function ๐‘ˆ.

s.t. 1 ๐œŠ๐‘ข ๐œˆ/๐‘€ for all ๐‘ข. (2) ๐œŠ๐‘ข quickly converges to ๐œˆ/๐‘€. quickly acheives ๐ ๐ฏ๐ฆ๐ฆ acceleartion! strictly ๐ ๐›๐ญ๐ฎ๐Ÿ๐ฌ than nonaccel GD!

slide-10
SLIDE 10

Open problem

Obtaining acceleration the non-strongly convex case?

Remarks

โ˜… Using strongly convex perturbation can be done โ˜… But, extra

factor

โ˜… More crucially, our current proof needs to ensure allโ€จ

iterates remain within a set of specific size to be ableโ€จ to ensure acceleration. Removing this limitation is valuable

O(log 1/ฯต)