Incremental and Stochastic Majorization-Minimization Algorithms for - PowerPoint PPT Presentation

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28

A simple optimization principle g ( θ ) f ( θ ) b κ Objective: min θ ∈ Θ f ( θ ) Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing. Julien Mairal Incremental and Stochastic MM Algorithms 2/28

In this work g ( θ ) f ( θ ) b κ scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems; References J. Mairal. Optimization with First-Order Surrogate Functions. ICML’13; J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS’13. Julien Mairal Incremental and Stochastic MM Algorithms 3/28

Setting: First-Order Surrogate Functions g ( θ ) f ( θ ) h ( θ ) b κ g ( θ ′ ) ≥ f ( θ ′ ) for all θ ′ in arg min θ ∈ Θ g ( θ ); the approximation error h △ = g − f is differentiable, and ∇ h is L -Lipschitz. Moreover, h ( κ ) = 0 and ∇ h ( κ ) = 0. Julien Mairal Incremental and Stochastic MM Algorithms 4/28

The Basic MM Algorithm Algorithm 1 Basic Majorization-Minimization Scheme 1: Input: θ 0 ∈ Θ (initial estimate); N (number of iterations). 2: for n = 1 , . . . , N do Compute a surrogate g n of f near θ n − 1 ; 3: Minimize g n and update the solution: 4: θ n ∈ arg min g n ( θ ) . θ ∈ Θ 5: end for 6: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 5/28

Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Julien Mairal Incremental and Stochastic MM Algorithms 6/28

Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Proximal Gradient Surrogates : f = f 1 + f 2 with f 1 smooth. g : θ �→ f 1 ( κ ) + ∇ f 1 ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 + f 2 ( θ ) . Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007] . Julien Mairal Incremental and Stochastic MM Algorithms 6/28

Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Julien Mairal Incremental and Stochastic MM Algorithms 7/28

Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Quadratic Surrogates : f is twice differentiable, and H is a uniform upper bound of ∇ 2 f : g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + 1 2( θ − κ ) ⊤ H ( θ − κ ) . Actually a big deal in statistics and machine learning [B¨ ohning and Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012] . Julien Mairal Incremental and Stochastic MM Algorithms 7/28

Examples of First-Order Surrogate Functions More Exotic Surrogates : Consider a smooth approximation of the trace (nuclear) norm p � � ( θ ⊤ θ + µ I ) 1 / 2 � � f µ : θ �→ Tr = λ i ( θ ⊤ θ ) + µ, i =1 f ′ : H �→ Tr H 1 / 2 � � is concave on the set of p.d. matrices and ∇ f ′ ( H ) = (1 / 2) H − 1 / 2 . g µ : θ �→ f µ ( κ ) + 1 � � ( κ ⊤ κ + µ I ) − 1 / 2 ( θ ⊤ θ − κ ⊤ κ ) 2 Tr , which yields the algorithm of Mohan and Fazel [2012]. Julien Mairal Incremental and Stochastic MM Algorithms 8/28

Examples of First-Order Surrogate Functions Variational Surrogates : f ( θ 1 ) △ = min θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly convex w.r.t θ 2 : g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 ) with κ ⋆ △ ˜ = arg min f ( κ 1 , θ 2 ) . 2 θ 2 ∈ Θ 2 Saddle-Point Surrogates : f ( θ 1 ) △ = max θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly concave w.r.t θ 2 : 2 ) + L ′′ g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 � θ 1 − κ 1 � 2 2 . Jensen Surrogates : f ( θ ) △ = ˜ f ( x ⊤ θ ), where ˜ f is L -smooth. Choose a weight vector w in R p + such that � w � 1 = 1 and w i � = 0 whenever x i � =0. � x i p � � ( θ i − κ i ) + x ⊤ κ g : θ �→ w i f , w i i =1 Julien Mairal Incremental and Stochastic MM Algorithms 9/28

Theoretical Guarantees for non-convex problems: f ( θ n ) monotically decreases and ∇ f ( θ n , θ − θ n ) lim inf n → + ∞ inf ≥ 0 , � θ − θ n � 2 θ ∈ Θ which is an asymptotic stationary point condition. for convex ones: f ( θ n ) − f ⋆ = O (1 / n ). for µ - strongly convex ones: the convergence rate is linear O ((1 − µ/ L ) n ). the convergence rates and the proof techniques are the same as for proximal gradient methods [Nesterov, 2007, Beck and Teboulle, 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 10/28

New Majorization-Minimization Algorithms Given f : R p → R and Θ ⊆ R p , our goal is to solve min θ ∈ Θ f ( θ ) . We introduce algorithms for non-convex and convex optimization: a block coordinate scheme for separable surrogates; an incremental algorithm dubbed MISO for separable functions f ; a stochastic algorithm for minimizing expectations; Also several variants for convex optimization : an accelerated one (Nesterov’s like); a “Frank-Wolfe” majorization-minimization algorithm. Julien Mairal Incremental and Stochastic MM Algorithms 11/28

Incremental Optimization: MISO Suppose that f splits into many components: T f ( θ ) = 1 � f t ( θ ) . T t =1 Recipe Incrementally update an approximate surrogate 1 � T t =1 g t ; T add some heuristics for practical implementations. Related (Inspiring) Work for Convex Problems related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules. Julien Mairal Incremental and Stochastic MM Algorithms 12/28

Incremental Optimization: MISO Algorithm 2 Incremental Scheme MISO 1: Input: θ 0 ∈ Θ; N (number of iterations). 0 of f t near θ 0 for all t ; 2: Choose surrogates g t 3: for n = 1 , . . . , N do t n and choose a surrogate g ˆ n of f ˆ Randomly pick up one index ˆ t n t n 4: △ near θ n − 1 . Set g t = g t n − 1 for t � = ˆ t n ; n Update the solution: 5: T 1 � g t θ n ∈ arg min n ( θ ) T θ ∈ Θ t =1 . 6: end for 7: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 13/28

Incremental Optimization: MISO Update Rule for Proximal Gradient Surrogates We want to minimize 1 � T t =1 f t 1 ( θ ) + f 2 ( θ ). T T 1 f 1 ( κ t ) + ∇ f 1 ( κ t ) ⊤ ( θ − κ t ) + L � 2 � θ − κ t � 2 θ n = arg min 2 + f 2 ( θ ) T θ ∈ Θ t =1 2 � � T T �� 1 1 κ t − 1 + 1 � � � � ∇ f t 1 ( κ t ) = arg min � θ − Lf 2 ( θ ) . � � 2 � T LT � θ ∈ Θ t =1 t =1 � 2 Then, randomly draw one index t n , and update κ t n ← θ n . Remark t =1 κ t by θ n − 1 yields SAG [Schmidt et al., 2013]; � T 1 remove f 2 , replace T replace L by µ is “close” to SDCA [Shalev-Schwartz and Zhang, 2012]; Julien Mairal Incremental and Stochastic MM Algorithms 14/28

Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Julien Mairal Incremental and Stochastic MM Algorithms 15/28

Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Remarks for µ -strongly convex problems, the rates of SDCA and SAG are better: µ/ ( LT ) is replaced by O (min( µ/ L , 1 / T )); MISO with minorizing surrogates is close to SDCA with “similar” convergence rates (details to be written yet). Julien Mairal Incremental and Stochastic MM Algorithms 15/28

Incremental and Stochastic Majorization-Minimization Algorithms for - PowerPoint PPT Presentation

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28 A simple

Majorization and Extreme Points: Economic Applications Andreas Kleiner, Benny Moldovanu, and

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Benefits of Radial Build Benefits of Radial Build Minimization and Requirements Minimization and

ARS Workshop Context Markov Random Fields minimization and minimal cuts in Exact total variation

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Moment methods in energy minimization David de Laat CWI Amsterdam Andrejewski-Tage Moment

1 The Minimization Problem The Minimization Problem Input: A DFA (deterministic finite-state

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

One-Dimensional Minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Minimization Using Descent Information we will consider the minimization of unconstrained

Cluster Minimization in Geometric Graphs Jakob Geiger Motivation Motivation Cluster

Understanding the Effectiveness of Plutonium Surrogates for Waste and Stockpile Immobilisation

Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computing Geoffrey

Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking

The social impact of algorithmic decision making: Economic perspectives Maximilian Kasy Fall

Multifidelity importance sampling methods for rare event simulation Benjamin Peherstorfer

Unicode Introduction Ken Zook November, 2006 1 Unicode properties 0041;LATIN CAPITAL LETTER

Surrogate Dependencies (in NodeJS) @DinisCruz London, 29th Sep 2016 Me Developer for 25

SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates June, 2019 Martin Engilberge, Louis