stochastic gradient descent on riemannian manifolds
play

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel - PowerPoint PPT Presentation

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel 1 Centre de Robotique - Mathmatiques et systmes Ecole des Mines de Paris" 2 Journes du GDR ISIS Paris, 20 novembre 2014 1 silvere.bonnabel@mines-paristech 2


  1. Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Centre de Robotique - Mathématiques et systèmes “Ecole des Mines de Paris" 2 Journées du GDR ISIS Paris, 20 novembre 2014 1 silvere.bonnabel@mines-paristech 2 Mines ParisTech, PSL Research University

  2. Introduction • We proposed a stochastic gradient algorithm on a specific manifold for matrix regression in: • Regression on fixed-rank positive semidefinite matrices: a Riemannian approach , Meyer, Bonnabel and Sepulchre, Journal of Machine Learning Research, 2011. • Compete(ed) with (then) state of the art for low-rank Mahalanobis distance and kernel learning • Convergence then left as an open question • The material of today’s presentation is the paper Stochastic gradient descent on Riemannian manifolds , IEEE Trans. on Automatic Control, 2013. • Bottou and Bousquet have recently popularized SGD in machine learning as randomly picking the data is a way to handle ever-increasing datasets.

  3. Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis (due to L. Bottou) 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

  4. Classical example Linear regression: Consider the linear model y = x T w + ν where x , w ∈ R d and y ∈ R and ν ∈ R a noise. • examples: z = ( x , y ) • loss (prediction error): y ) 2 = ( y − x T w ) 2 Q ( z , w ) = ( y − ˆ � • cannot minimize expected risk C ( w ) = Q ( z , w ) dP ( z ) � n • minimize empirical risk instead ˆ C n ( w ) = 1 i = 1 Q ( z i , w ) . n

  5. Gradient descent Batch gradient descent : process all examples together � � n 1 � w t + 1 = w t − γ t ∇ w Q ( z i , w t ) n i = 1 Stochastic gradient descent : process examples one by one w t + 1 = w t − γ t ∇ w Q ( z t , w t ) for some random example z t = ( x t , y t ) .

  6. Gradient descent Batch gradient descent : process all examples together � � n 1 � w t + 1 = w t − γ t ∇ w Q ( z i , w t ) n i = 1 Stochastic gradient descent : process examples one by one w t + 1 = w t − γ t ∇ w Q ( z t , w t ) for some random example z t = ( x t , y t ) . ⇒ well known identification algorithm for Wiener- ARMAX systems n m � � b i u t − i + v t = ψ T y t = a i y t − i + t w + v t , 1 1 Q ( y t , w t ) = ( y t − ψ T t w t ) 2

  7. Stochastic versus online Stochastic : examples drawn randomly from a finite set • SGD minimizes the empirical risk Online : examples drawn with unknown dP ( z ) • SGD minimizes the expected risk (+ tracking property) Stochastic approximation: approximate a sum by a stream of single elements

  8. Stochastic versus batch SGD can converge very slowly : for a long sequence ∇ w Q ( z t , w t ) may be a very bad approximation of � � n 1 � ∇ w ˆ C n ( w t ) = ∇ w Q ( z i , w t ) n i = 1 SGD can converge very fast when there is redundancy • extreme case z 1 = z 2 = · · ·

  9. Some examples Least mean squares: y ) 2 = ( y − x T w ) • Loss: Q ( z , w ) = ( y − ˆ • Update: w t + 1 = w t − γ t ∇ w Q ( z t , w t ) = w t − γ t ( y t − ˆ y t ) x t Robbins-Monro algorithm (1951): C smooth with a unique minimum ⇒ the algorithm converges in L 2 k-means: McQueen (1967) • Procedure: pick z t , attribute it to w k • Update: w k t + 1 = w k t + γ t ( z t − w k t )

  10. Some examples Ballistics example (old). Early adaptive control • optimize the trajectory of a projectile in fluctuating wind • successive gradient corrections on the launching angle • with γ t → 0 it will stabilize to an optimal value

  11. Another example: mean � Computing a mean : Total loss 1 i � z i − w � 2 n � Minimum : w − 1 i z i = 0 i.e. w is the mean of the points z i n Stochastic gradient : w t + 1 = w t − γ t ( w t − z i ) where z i randomly picked 3 3 what if �� is replaced with some more exotic distance ?

  12. Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis (due to L. Bottou) 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

  13. Notation Expected cost : � C ( w ) := E z ( Q ( z , w )) = Q ( z , w ) dP ( z ) Approximated gradient under the event z denoted by H ( z , w ) � E z H ( z , w ) = ∇ ( Q ( z , w ) dP ( z )) = ∇ C ( w ) Stochastic gradient update : w t + 1 ← w t − γ t H ( z t , w t )

  14. Convergence results Convex case : known as Robbins-Monro algorithm. Convergence to the global minimum of C ( w ) in mean, and almost surely. Nonconvex case . C ( w ) is generally not convex. We are interested in proving • almost sure convergence • a.s. convergence of C ( w t ) • ... to a local minimum • ∇ C ( w t ) → 0 a.s. Provable under a set of reasonable assumptions

  15. Assumptions Step sizes : the steps must decrease. Classically � � γ 2 t < ∞ γ t = + ∞ and The sequence γ t = t − α , provides examples for 1 2 < α ≤ 1. Cost regularity : averaged loss C ( w ) 3 times differentiable (relaxable). Sketch of the proof 1 confinement: w t remains a.s. in a compact. 2 convergence: ∇ C ( w t ) → 0 a.s.

  16. Confinement Main difficulties: 1 Only an approximation of the cost is available 2 We are in discrete time Approximation : the noise can generate unbounded trajectories with small but nonzero probability. Discrete time : even without noise yields difficulties as there is no line search. SO ? : confinement to a compact holds under a set of assumptions: well, see the paper 4 ... 4 L. Bottou: Online Algorithms and Stochastic Approximations. 1998.

  17. Convergence (simplified) Confinement • All trajectories can be assumed to remain in a compact set • All continuous functions of w t are bounded Convergence Letting h t = C ( w t ) > 0, second order Taylor expansion: h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t � H ( z t , w t ) � 2 K 1 with K 1 upper bound on ∇ 2 C and � H ( z t , w t ) � 2 < A .

  18. Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1

  19. Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1 Conditioning w.r.t. F t = { z 0 , · · · , z t − 1 , w 0 , · · · , w t } and letting ∞ � γ 2 AK 1 ≥ 0 g t := h t + t we have E [ g t + 1 − g t | F t ] ≤ − 2 γ t �∇ C ( w t ) � 2 . � �� � this term ≤ 0

  20. Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1 Conditioning w.r.t. F t = { z 0 , · · · , z t − 1 , w 0 , · · · , w t } and letting ∞ � γ 2 AK 1 ≥ 0 g t := h t + t we have E [ g t + 1 − g t | F t ] ≤ − 2 γ t �∇ C ( w t ) � 2 . � �� � this term ≤ 0 Thus g t supermartingale so it converges a.s. and � 2 γ t �∇ C ( w t ) � 2 < ∞ 0 ≤ t As � γ t = ∞ we have ∇ C ( w t ) converges a.s. to 0.

  21. Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

  22. Connected Riemannian manifold Riemannian manifold : local coordinates around any point Tangent space : Riemmanian metric : scalar product � u , v � g on the tangent space

  23. Riemannian manifolds Riemannian manifold carries the structure of a metric space whose distance function is the arclength of a minimizing path between two points. Length of a curve c ( t ) ∈ M � b � b � � ˙ c ( t ) , ˙ c ( t )) � g dt = � ˙ c ( t ) � dt L = a a Geodesic : curve of minimal length joining sufficiently close x and y . Exponential map : exp x ( v ) is the point z ∈ M situated on the geodesic with initial position-velocity ( x , v ) at distance � v � of x .

  24. Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g

  25. Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g Riemannian Hessian : based on the Taylor expansion f ( exp x ( tv )) = t � v , ∇ f ( x ) � g + 1 2 t 2 v T [ Hess f ( x )] v + O ( t 3 )

  26. Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g Riemannian Hessian : based on the Taylor expansion f ( exp x ( tv )) = t � v , ∇ f ( x ) � g + 1 2 t 2 v T [ Hess f ( x )] v + O ( t 3 ) Second order Taylor expansion : f ( exp x ( tv )) − f ( x ) ≤ t � v , ∇ f ( x ) � g + t 2 2 � v � 2 g k where k is a bound on the hessian along the geodesic.

  27. Riemannian SGD on M Riemannian approximated gradient : E z ( H ( z t , w t )) = ∇ C ( w t ) a tangent vector ! Stochastic gradient descent on M : update w t + 1 ← exp w t ( − γ t H ( z t , w t )) w t + 1 must remain on M !

  28. Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend