stochastic approximation based algorithms when the monte
play

Stochastic approximation-based algorithms, when the Monte Carlo bias - PowerPoint PPT Presentation

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math ematiques de Toulouse CNRS Toulouse,


  1. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math´ ematiques de Toulouse CNRS Toulouse, France

  2. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) Edouard Ollier (ENS Lyon, France) Laurent Risser (IMT, France). Adeline Samson (Univ. Grenoble Alpes, France). and published in the papers (or works in progress) - Convergence of the Monte-Carlo EM for curved exponential families (Ann. Stat., 2003) - On Perturbed Proximal-Gradient algorithms (JMLR, 2017) - Stochastic Proximal Gradient Algorithms for Penalized Mixed Models (Statistics and Computing, 2018) - Stochastic FISTA algorithms : so fast ? (IEEE workshop SSP, 2018)

  3. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish The topic This talk : answer a computationnel issue ◮ Find θ ∗ ∈ argmin θ ∈ Θ ( f ( θ ) + g ( θ )) (1) where Θ ⊆ R d (extension to any Hilbert possible; not done) g is not smooth , but is convex and proper, lower semi-continuous ( ”prox” operator ) f is is not explicit / is untractable , ∇ f exists but is not explicit / is untractable When proving results : f is convex and ∇ f is Lipschitz ◮ In this talk : numerical tools to solve (1) based on first order methods; convergence analysis.

  4. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms

  5. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 1 : large scale learning Minimization of a composite function g = 0 or g is a penalty / regularization / constraint condition on the parameter θ f is an (empirical) loss function associated to N examples N � f ( θ ) = 1 f i ( θ ) N i =1 when N is large For any i , f i and ∇ f i can be evaluated at any point θ but the computation of the sum over N terms is too expensive. Rmk that ∇ f ( θ ) = E [ ∇ f I ( θ )] where I r.v. uniform on { 1 , · · · , N } .

  6. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 2 : binary graphical model Minimization of a composite function Observation y ∈ {− 1 , 1 } p (a binary vector of length p , collecting the binary values of p nodes) , with statistical model � p � p p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 with an untractable normalizing constant exp( Z θ ) . θ collects the ”weights”. f is the negative log-likelihood of N indep. observations � � � � p p p N N � � � � � Y ( n ) N − 1 N − 1 f ( θ ) = − log Z θ + θ i + θ ij 1 I Y ( n ) = Y ( n ) i i j i =1 n =1 i =1 j = i +1 n =1 In this model ∇ f ( θ ) = E θ [ H ( X, θ )] where X ∼ π θ g = 0 or g is a penalty / regularization / constraint condition on the parameter θ (the number of observations N << p 2 / 2 )

  7. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 3 : Parametric inference in Latent variable models Minimization of a composite function g is a penalty function (e.g. for sparsity condition on θ ) f is the negative log-likelihood of the N observations � f ( θ ) = − log h ( x, Y 1: N ; θ ) ν ( d x ) X and the gradient is of the form � h ( x, Y 1: N ; θ ) ∇ f ( θ ) = ∂ θ log h ( x, Y 1: N ; θ ) � X h ( u, Y 1: N ; θ ) ν ( d u ) ν ( d x ) X i.e. an expectation w.r.t. the a posteriori distribution (known up to a normalizing constant in these models)

  8. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms

  9. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : the ingredient argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) � �� � � �� � smooth non smooth The Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) where � � g ( θ ) + 1 def 2 γ � θ − τ � 2 Prox γ,g ( τ ) = argmin θ ∈ Θ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective fct. A Majorize-Minimize algorithm from a quadratic majorization of f (since Lipschitz gradient) which produces a sequence { θ n , n ≥ 0 } such that F ( θ n +1 ) ≤ F ( θ n ) . In our frameworks, ∇ f ( θ ) is not available.

  10. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : a perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n + 1 ) where H n +1 is an approximation of ∇ f ( θ n ) . Useful for the proof: observe     θ n +1 = Prox γ n +1 ,g  θ n − γ n +1 ∇ f ( θ n ) − γ n +1 ( H n +1 − ∇ f ( θ n ))  � �� � perturbation

  11. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence result : the assumptions (1/2) argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) where the function g : R d → [0 , ∞ ] is convex, non smooth, not identically equal to + ∞ , and lower semi-continuous the function f : R d → R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that ∀ θ, θ ′ ∈ R d �∇ f ( θ ) − ∇ f ( θ ′ ) � ≤ L � θ − θ ′ � Θ ⊆ R d is the domain of g : Θ = { θ ∈ R d : g ( θ ) < ∞} . The set argmin Θ F is a non-empty subset of Θ .

  12. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results (2/2) θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n +1 ) with H n +1 ≈ ∇ f ( θ n ) Set: L = argmin Θ ( f + g ) η n +1 = H n +1 − ∇ f ( θ n ) Theorem (Atchad´ e, F., Moulines (2017)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L ; L is non empty. � n γ n = + ∞ and γ n ∈ (0 , 1 /L ] . Convergence of the series � � � γ 2 n +1 � η n +1 � 2 , γ n +1 η n +1 , γ n +1 � T n , η n +1 � n n n where T n = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) . Then there exists θ ⋆ ∈ L such that lim n θ n = θ ⋆ .

  13. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Sketch of proof Its proof relies on a deterministic Lyapunov inequality 1 � θ n +1 − θ ⋆ � 2 ≤ � θ n − θ ⋆ � 2 − 2 γ n +1 � � � � + 2 γ 2 n +1 � η n +1 � 2 F ( θ n +1 ) − min F − 2 γ n +1 T n − θ ⋆ , η n +1 � �� � � �� � non-negative signed noise (an extension of) the Robbins-Siegmund lemma 2 Let { v n , n ≥ 0 } and { χ n , n ≥ 0 } be non-negative sequences and { ξ n , n ≥ 0 } be such that � n ξ n exists. If for any n ≥ 0 , v n +1 ≤ v n − χ n +1 + ξ n +1 then � n χ n < ∞ and lim n v n exists. Rmk: deterministic lemma, signed noise.

  14. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods What about Nesterov-based acceleration ? (FISTA) Let { t n , n ≥ 0 } be a positive sequence s.t. γ n +1 t n ( t n − 1) ≤ γ n t 2 n − 1 Nesterov acceleration of the Proximal Gradient algorithm θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 ∇ f ( τ n )) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) � 1 � (deterministic) Proximal-gradient F ( θ n ) − min F = O n � 1 � (deterministic) Accelerated Proximal-gradient F ( θ n ) − min F = O n 2

  15. Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results for perturbed FISTA When ∇ f ( τ n ) is replaced with H n +1 Perturbed FISTA H n +1 ≈ ∇ f ( τ n ) θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 H n +1 ) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) def Under conditions on γ n , t n and on the perturbation ˜ η n +1 = H n +1 − ∇ f ( τ n ) � γ n +1 t n � z n − θ ∗ , ˜ η n +1 � < ∞ n we have (F., Risser, Atchad´ e, Moulines; 2018) lim n γ n +1 t 2 n F ( θ n ) exists Explicit control of this quantity.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend