Escaping from saddle points on Riemannian manifolds Yue Sun , - - PowerPoint PPT Presentation

escaping from saddle points on riemannian manifolds
SMART_READER_LITE
LIVE PREVIEW

Escaping from saddle points on Riemannian manifolds Yue Sun , - - PowerPoint PPT Presentation

Escaping from saddle points on Riemannian manifolds Yue Sun , Nicolas Flammarion , Maryam Fazel Department of Electrical and Computer Engineering, University of Washington, Seattle School of Computer and Communication Sciences,


slide-1
SLIDE 1

Escaping from saddle points on Riemannian manifolds

Yue Sun†, Nicolas Flammarion‡, Maryam Fazel†

† Department of Electrical and Computer Engineering, University of Washington,

Seattle

‡ School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland

November 8, 2019

1 / 24

slide-2
SLIDE 2

Manifold constrained optimization

We consider the problem minimize

x

f (x), subject to x ∈ M Same as Euclidean space, generally we cannot find global optimum in polynomial time, so we want to find an approximate local minimum. Plot of saddle point in Euclidean space. Contour of function value on sphere.

2 / 24

slide-3
SLIDE 3

Example of manifolds

  • 1. Sphere. {x ∈ Rd : d

i=1 x2 i = r2}.

  • 2. Stiefel manifold. {X ∈ Rm×n : X TX = I}.
  • 3. Grassmannian manifold. Grass(p, n) is set of p dimensional

subspaces in Rn.

  • 4. Bruer-Monteiro relaxation. {X ∈ Rm×n : diag(X TX) = 1}.

3 / 24

slide-4
SLIDE 4

Curve

A curve in a continuous map γ : t → M. Usually t ∈ [0, 1], where γ(0) and γ(1) are start and end points of the curve.

4 / 24

slide-5
SLIDE 5

Tangent vector and tangent space

We use ˙ γ(t) = lim

τ→0

γ(t + τ) − γ(t) τ as the velocity of the curve, ˙ γ(t) is a tangent vector at γ(t) ∈ M. x ∈ M can be start point of many curves, and a tangent space TxM is the set of tangent vectors at x. Tangent space is a metric space. 5 / 24

slide-6
SLIDE 6

Gradient of a function

Let f : M → R be a function defined on M, and γ be a curve on M. The directional derivative of f in direction ˙ γ(0) is1 ˙ γ(0)f = d(f (γ(t))) dt

  • t=0 = lim

τ→0

f (γ(0 + τ)) − f (γ(0)) τ Then we can define gradf (x) ∈ TxM, which satisfies gradf , y = yf for all y ∈ TxM.

1Usually ˙

γ denotes differential operator and γ′ denotes the tangent vector. They are closely related. To avoid confusion, we always use ˙ γ.

6 / 24

slide-7
SLIDE 7

Vector field

The gradient of a function is a special case of vector field of a manifold. A vector field is a function from a point in M to tangent vector at that point.

7 / 24

slide-8
SLIDE 8

Connection

Denote a smooth vector field on M by X(M). A connection defines directional derivative ∇ : X(M) × X(M) → X(M) satisfying ∇fx+gyu = f ∇xu + g∇yu, ∇x(au + bv) = a∇xu + b∇xv, ∇x(fu) = (xf )u + f (∇xu). Note that ∇eiu =

  • j

(∂iujej + uj∇eiej). A special connection is Riemannian (Levi-Civita) connection. 8 / 24

slide-9
SLIDE 9

Riemannian Hessian

A directional Hessian is defined as H(x)[u] = ∇ugradf (x) for u ∈ TxM2. Similar as gradient, we can define a Hessian from directional version. H(x)u, v = ∇ugradf (x), v, ∀u, v ∈ TxM. It is a symmetric operator.

2In Riemannian geometry, one writes ux to indicate u ∈ TxM and the

directional Hessian is ∇ux gradf .

9 / 24

slide-10
SLIDE 10

Geodesic

A geodesic is a special class of curves on the manifold, which satisfies zero acceleration condition ∇ ˙

γ(t) ˙

γ = 0.

10 / 24

slide-11
SLIDE 11

Exponential map

For any x ∈ M, y ∈ TxM and the geodesic γ defined by y, γ(0) = x, ˙ γ(0) = y we call the mapping Expx(y) : TxM → M such that Expx(y) = γ(1) as exponential map. There is a neighborhood with radius I in TxM, such that for all y ∈ TxM, y ≤ I, exponential map is a bijection/diffeomorphism.

11 / 24

slide-12
SLIDE 12

Parallel transport

The parallel transport Γ transports a tangent vector w along a curve γ, satisfying the zero acceleration condition ∇γtwt = 0, wt = Γγ(t)

γ(0)w.

12 / 24

slide-13
SLIDE 13

Curvature tensor

The curvature tensor describes how curved the manifold is. It relates to the second order structure of the manifold. A definition by connection is R(x, y)w = ∇x∇yw − ∇y∇xw − ∇[x,y]w. where x, y, w are in tangent space of the same point3.

3[x, y] is the Lie bracket defined by [x, y]f = xyf − yxf

13 / 24

slide-14
SLIDE 14

Curvature tensor

R(x, y)w = lim

t,τ→0

Γ0,0

0,τyΓ0,τy tx,τyΓtx,τy tx,0 Γtx,0 0,0 w − w

14 / 24

slide-15
SLIDE 15

Smooth function on Riemannian manifold

We consider the manifold constrained optimization problem minimize

x

f (x), subject to x ∈ M assuming the function and manifold satisfying

  • 1. There is a finite constant β such that

gradf (y) − Γy

xgradf (x) ≤ βd(x, y)

for all x, y ∈ M.

  • 2. There is a finite constant ρ such that

H(y) − Γy

xH(x)Γx y2 ≤ ρd(x, y)

for all x, y ∈ M.

  • 3. There is a finite constant K such that

|R(x)[u, v]| ≤ K for all x ∈ M and u, v ∈ TxM. f may not be convex. 15 / 24

slide-16
SLIDE 16

Taylor expansion of a smooth function

For x, y ∈ E Euclidean space, f (y) − f (x) = y − x, ∇f (x) + ∇2f (x)(y − x) + 1 (∇2f (x + τ(y − x)) − ∇2f (x))(y − x)dτ. For x, y ∈ M, let γ denote the geodesic where γ(0) = x, γ(1) = y, then f (y) − f (x) = ˙ γ(0), gradf (x) + 1 2∇ ˙

γ(0)gradf + ∆

where ∆ = 1

0 ∆(γ(τ))dτ,

∆(γ(τ)) = 1 (Γx

γ(τ)∇ ˙ γ(τ)gradf − ∇ ˙ γ(0)gradf )dτ.

16 / 24

slide-17
SLIDE 17

Riemannian gradient descent

On a smooth manifold, there exists a η such that, if xt+1 = Expxt(−gradf (xt)), then f (xt+1) ≤ f (xt) − η 2gradf (xt)2. Converge to first order stationary.

17 / 24

slide-18
SLIDE 18

Proposed algorithm for escaping saddle

Hope to escape from saddle point and converge to an approximate local minimum.

  • 1. At iterate x, check the norm of gradient.
  • 2. If large: do x+ = Expx(−ηgradf (x)) to decrease function value.
  • 3. If small: near either a saddle point or a local min. Perturb iterate by

adding appropriate noise, run a few iterations. 3.1 if f decreases, iterates escape saddle point (and alg continues). 3.2 if f doesn’t decrease: at approximate local min (alg terminates). 18 / 24

slide-19
SLIDE 19

Difficulty of second order analysis

  • 1. We use linearization in first order analysis, for second order,

manifold has a second order structure as well.

  • 2. Consider power method in Euclidean space. We need to prove

that the biggest eigenvector direction of x grows exponentially.

  • 3. If it’s iteration of variable, we have to consider gradient in

different tangent spaces.

  • 4. Some recent work require strong assumptions such as flat

manifold, product manifold.

  • 5. Other recent work assume smoothness parameters of

composition of function and manifold operator, which are hard to check.

19 / 24

slide-20
SLIDE 20

Useful lemmas

Let x ∈ M and y, a ∈ TxM. Let us denote by z = Expx(a) then d(Expx(y + a), Expz(Γz

xy)) ≤ c(K) min{a, y}(a + y)2.

20 / 24

slide-21
SLIDE 21

Useful lemmas

Holonomy. Γx

zΓz yΓy xw − w ≤ c(K)d(x, y)d(y, z)w.

Similar to definition of curvature tensor, if a vector is parallel transported around a closed curve, then the change is bounded by the area whose boundary is the curve.

21 / 24

slide-22
SLIDE 22

Useful lemmas

Euclidean: f (x) = xTHx ⇒ x+ = (I − ηH)x. Exponential growth in a vector space. If function f is β gradient Lipschitz, ρ Hessian Lipschitz, curvature constant is bounded by K, x is a (ǫ, −√ˆ ρǫ) saddle point, and define u+ = Expu(−ηgradf (u)) and w+ = Expw(−ηgradf (w)). If a small enough neighborhood4, Exp−1

x (w+) − Exp−1 x (u+) − (I − ηH(x))(Exp−1 x (w) − Exp−1 x (u))

≤ C(K, ρ, β)d(u, w) (d(u, w) + d(u, x) + d(w, x)) . for some explicit constant C(K, ρ, β).

4Quantified in paper.

22 / 24

slide-23
SLIDE 23

Theorem

Theorem (Jin et al., Eucledean space) Perturbed GD converges to a (ǫ, −√ρǫ)-stationary point of f in O

  • β(f (x0) − f (x∗))

ǫ2 log4 βd(f (x0) − f (x∗)) ǫ2δ

  • iterations.

We replace Hessian Lipschitz ρ by ˆ ρ as a function of ρ and K and we quantify it in the paper. Theorem (manifold) Perturbed RGD converges to a (ǫ, −

  • ˆ

ρ(ρ, K)ǫ)-stationary point of f in O

  • β(f (x0) − f (x∗))

ǫ2 log4 βd(f (x0) − f (x∗)) ǫ2δ

  • iterations.

23 / 24

slide-24
SLIDE 24

Experiment

Burer-Monteiro facotorization.

Let A ∈ Sd×d, the problem max

X∈Sd×d trace(AX),

s.t. diag(X) = 1, X 0, rank(X) ≤ r. can be factorized as max

Y ∈Rd×p trace(AYY T), s.t. diag(YY T) = 1.

when r(r + 1)/2 ≤ d, p(p + 1)/2 ≥ d.

5 10 15 20 25 30 35 40 45

Iterations

1 2 3 4 5 6

Function value

Iteration versus function value.

24 / 24