incremental and stochastic majorization minimization
play

Incremental and Stochastic Majorization-Minimization Algorithms for - PowerPoint PPT Presentation

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28 A simple


  1. Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28

  2. A simple optimization principle g ( θ ) f ( θ ) b κ Objective: min θ ∈ Θ f ( θ ) Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing. Julien Mairal Incremental and Stochastic MM Algorithms 2/28

  3. In this work g ( θ ) f ( θ ) b κ scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems; References J. Mairal. Optimization with First-Order Surrogate Functions. ICML’13; J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS’13. Julien Mairal Incremental and Stochastic MM Algorithms 3/28

  4. Setting: First-Order Surrogate Functions g ( θ ) f ( θ ) h ( θ ) b κ g ( θ ′ ) ≥ f ( θ ′ ) for all θ ′ in arg min θ ∈ Θ g ( θ ); the approximation error h △ = g − f is differentiable, and ∇ h is L -Lipschitz. Moreover, h ( κ ) = 0 and ∇ h ( κ ) = 0. Julien Mairal Incremental and Stochastic MM Algorithms 4/28

  5. The Basic MM Algorithm Algorithm 1 Basic Majorization-Minimization Scheme 1: Input: θ 0 ∈ Θ (initial estimate); N (number of iterations). 2: for n = 1 , . . . , N do Compute a surrogate g n of f near θ n − 1 ; 3: Minimize g n and update the solution: 4: θ n ∈ arg min g n ( θ ) . θ ∈ Θ 5: end for 6: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 5/28

  6. Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Julien Mairal Incremental and Stochastic MM Algorithms 6/28

  7. Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Proximal Gradient Surrogates : f = f 1 + f 2 with f 1 smooth. g : θ �→ f 1 ( κ ) + ∇ f 1 ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 + f 2 ( θ ) . Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007] . Julien Mairal Incremental and Stochastic MM Algorithms 6/28

  8. Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Julien Mairal Incremental and Stochastic MM Algorithms 7/28

  9. Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Quadratic Surrogates : f is twice differentiable, and H is a uniform upper bound of ∇ 2 f : g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + 1 2( θ − κ ) ⊤ H ( θ − κ ) . Actually a big deal in statistics and machine learning [B¨ ohning and Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012] . Julien Mairal Incremental and Stochastic MM Algorithms 7/28

  10. Examples of First-Order Surrogate Functions More Exotic Surrogates : Consider a smooth approximation of the trace (nuclear) norm p � � ( θ ⊤ θ + µ I ) 1 / 2 � � f µ : θ �→ Tr = λ i ( θ ⊤ θ ) + µ, i =1 f ′ : H �→ Tr H 1 / 2 � � is concave on the set of p.d. matrices and ∇ f ′ ( H ) = (1 / 2) H − 1 / 2 . g µ : θ �→ f µ ( κ ) + 1 � � ( κ ⊤ κ + µ I ) − 1 / 2 ( θ ⊤ θ − κ ⊤ κ ) 2 Tr , which yields the algorithm of Mohan and Fazel [2012]. Julien Mairal Incremental and Stochastic MM Algorithms 8/28

  11. Examples of First-Order Surrogate Functions Variational Surrogates : f ( θ 1 ) △ = min θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly convex w.r.t θ 2 : g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 ) with κ ⋆ △ ˜ = arg min f ( κ 1 , θ 2 ) . 2 θ 2 ∈ Θ 2 Saddle-Point Surrogates : f ( θ 1 ) △ = max θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly concave w.r.t θ 2 : 2 ) + L ′′ g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 � θ 1 − κ 1 � 2 2 . Jensen Surrogates : f ( θ ) △ = ˜ f ( x ⊤ θ ), where ˜ f is L -smooth. Choose a weight vector w in R p + such that � w � 1 = 1 and w i � = 0 whenever x i � =0. � x i p � � ( θ i − κ i ) + x ⊤ κ g : θ �→ w i f , w i i =1 Julien Mairal Incremental and Stochastic MM Algorithms 9/28

  12. Theoretical Guarantees for non-convex problems: f ( θ n ) monotically decreases and ∇ f ( θ n , θ − θ n ) lim inf n → + ∞ inf ≥ 0 , � θ − θ n � 2 θ ∈ Θ which is an asymptotic stationary point condition. for convex ones: f ( θ n ) − f ⋆ = O (1 / n ). for µ - strongly convex ones: the convergence rate is linear O ((1 − µ/ L ) n ). the convergence rates and the proof techniques are the same as for proximal gradient methods [Nesterov, 2007, Beck and Teboulle, 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 10/28

  13. New Majorization-Minimization Algorithms Given f : R p → R and Θ ⊆ R p , our goal is to solve min θ ∈ Θ f ( θ ) . We introduce algorithms for non-convex and convex optimization: a block coordinate scheme for separable surrogates; an incremental algorithm dubbed MISO for separable functions f ; a stochastic algorithm for minimizing expectations; Also several variants for convex optimization : an accelerated one (Nesterov’s like); a “Frank-Wolfe” majorization-minimization algorithm. Julien Mairal Incremental and Stochastic MM Algorithms 11/28

  14. Incremental Optimization: MISO Suppose that f splits into many components: T f ( θ ) = 1 � f t ( θ ) . T t =1 Recipe Incrementally update an approximate surrogate 1 � T t =1 g t ; T add some heuristics for practical implementations. Related (Inspiring) Work for Convex Problems related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules. Julien Mairal Incremental and Stochastic MM Algorithms 12/28

  15. Incremental Optimization: MISO Algorithm 2 Incremental Scheme MISO 1: Input: θ 0 ∈ Θ; N (number of iterations). 0 of f t near θ 0 for all t ; 2: Choose surrogates g t 3: for n = 1 , . . . , N do t n and choose a surrogate g ˆ n of f ˆ Randomly pick up one index ˆ t n t n 4: △ near θ n − 1 . Set g t = g t n − 1 for t � = ˆ t n ; n Update the solution: 5: T 1 � g t θ n ∈ arg min n ( θ ) T θ ∈ Θ t =1 . 6: end for 7: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 13/28

  16. Incremental Optimization: MISO Update Rule for Proximal Gradient Surrogates We want to minimize 1 � T t =1 f t 1 ( θ ) + f 2 ( θ ). T T 1 f 1 ( κ t ) + ∇ f 1 ( κ t ) ⊤ ( θ − κ t ) + L � 2 � θ − κ t � 2 θ n = arg min 2 + f 2 ( θ ) T θ ∈ Θ t =1 2 � � T T �� 1 1 κ t − 1 + 1 � � � � ∇ f t 1 ( κ t ) = arg min � θ − Lf 2 ( θ ) . � � 2 � T LT � θ ∈ Θ t =1 t =1 � 2 Then, randomly draw one index t n , and update κ t n ← θ n . Remark t =1 κ t by θ n − 1 yields SAG [Schmidt et al., 2013]; � T 1 remove f 2 , replace T replace L by µ is “close” to SDCA [Shalev-Schwartz and Zhang, 2012]; Julien Mairal Incremental and Stochastic MM Algorithms 14/28

  17. Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Julien Mairal Incremental and Stochastic MM Algorithms 15/28

  18. Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Remarks for µ -strongly convex problems, the rates of SDCA and SAG are better: µ/ ( LT ) is replaced by O (min( µ/ L , 1 / T )); MISO with minorizing surrogates is close to SDCA with “similar” convergence rates (details to be written yet). Julien Mairal Incremental and Stochastic MM Algorithms 15/28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend