Stochastic approximation-based algorithms, when the Monte Carlo bias - PowerPoint PPT Presentation

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math´ ematiques de Toulouse CNRS Toulouse, France

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) Edouard Ollier (ENS Lyon, France) Laurent Risser (IMT, France). Adeline Samson (Univ. Grenoble Alpes, France). and published in the papers (or works in progress) - Convergence of the Monte-Carlo EM for curved exponential families (Ann. Stat., 2003) - On Perturbed Proximal-Gradient algorithms (JMLR, 2017) - Stochastic Proximal Gradient Algorithms for Penalized Mixed Models (Statistics and Computing, 2018) - Stochastic FISTA algorithms : so fast ? (IEEE workshop SSP, 2018)

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish The topic This talk : answer a computationnel issue ◮ Find θ ∗ ∈ argmin θ ∈ Θ ( f ( θ ) + g ( θ )) (1) where Θ ⊆ R d (extension to any Hilbert possible; not done) g is not smooth , but is convex and proper, lower semi-continuous ( ”prox” operator ) f is is not explicit / is untractable , ∇ f exists but is not explicit / is untractable When proving results : f is convex and ∇ f is Lipschitz ◮ In this talk : numerical tools to solve (1) based on first order methods; convergence analysis.

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 1 : large scale learning Minimization of a composite function g = 0 or g is a penalty / regularization / constraint condition on the parameter θ f is an (empirical) loss function associated to N examples N � f ( θ ) = 1 f i ( θ ) N i =1 when N is large For any i , f i and ∇ f i can be evaluated at any point θ but the computation of the sum over N terms is too expensive. Rmk that ∇ f ( θ ) = E [ ∇ f I ( θ )] where I r.v. uniform on { 1 , · · · , N } .

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 2 : binary graphical model Minimization of a composite function Observation y ∈ {− 1 , 1 } p (a binary vector of length p , collecting the binary values of p nodes) , with statistical model � p � p p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 with an untractable normalizing constant exp( Z θ ) . θ collects the ”weights”. f is the negative log-likelihood of N indep. observations � � � � p p p N N � � � � � Y ( n ) N − 1 N − 1 f ( θ ) = − log Z θ + θ i + θ ij 1 I Y ( n ) = Y ( n ) i i j i =1 n =1 i =1 j = i +1 n =1 In this model ∇ f ( θ ) = E θ [ H ( X, θ )] where X ∼ π θ g = 0 or g is a penalty / regularization / constraint condition on the parameter θ (the number of observations N << p 2 / 2 )

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Applications in Statistical Learning Example 3 : Parametric inference in Latent variable models Minimization of a composite function g is a penalty function (e.g. for sparsity condition on θ ) f is the negative log-likelihood of the N observations � f ( θ ) = − log h ( x, Y 1: N ; θ ) ν ( d x ) X and the gradient is of the form � h ( x, Y 1: N ; θ ) ∇ f ( θ ) = ∂ θ log h ( x, Y 1: N ; θ ) � X h ( u, Y 1: N ; θ ) ν ( d u ) ν ( d x ) X i.e. an expectation w.r.t. the a posteriori distribution (known up to a normalizing constant in these models)

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Outline The topic Applications in Statistical Learning A numerical solution: proximal-gradient based methods Case of Monte Carlo approximation Perturbed Proximal-Gradient algorithms and EM-based algorithms

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : the ingredient argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) � �� smooth non smooth The Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) where � � g ( θ ) + 1 def 2 γ � θ − τ � 2 Prox γ,g ( τ ) = argmin θ ∈ Θ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective fct. A Majorize-Minimize algorithm from a quadratic majorization of f (since Lipschitz gradient) which produces a sequence { θ n , n ≥ 0 } such that F ( θ n +1 ) ≤ F ( θ n ) . In our frameworks, ∇ f ( θ ) is not available.

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Numerical solution : a perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a stepsize sequence { γ n , n ≥ 0 } , iterative algorithm: θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n + 1 ) where H n +1 is an approximation of ∇ f ( θ n ) . Useful for the proof: observe     θ n +1 = Prox γ n +1 ,g  θ n − γ n +1 ∇ f ( θ n ) − γ n +1 ( H n +1 − ∇ f ( θ n ))  � �� perturbation

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence result : the assumptions (1/2) argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) where the function g : R d → [0 , ∞ ] is convex, non smooth, not identically equal to + ∞ , and lower semi-continuous the function f : R d → R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that ∀ θ, θ ′ ∈ R d �∇ f ( θ ) − ∇ f ( θ ′ ) � ≤ L � θ − θ ′ � Θ ⊆ R d is the domain of g : Θ = { θ ∈ R d : g ( θ ) < ∞} . The set argmin Θ F is a non-empty subset of Θ .

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results (2/2) θ n +1 = Prox γ n +1 ,g ( θ n − γ n +1 H n +1 ) with H n +1 ≈ ∇ f ( θ n ) Set: L = argmin Θ ( f + g ) η n +1 = H n +1 − ∇ f ( θ n ) Theorem (Atchad´ e, F., Moulines (2017)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L ; L is non empty. � n γ n = + ∞ and γ n ∈ (0 , 1 /L ] . Convergence of the series � � � γ 2 n +1 � η n +1 � 2 , γ n +1 η n +1 , γ n +1 � T n , η n +1 � n n n where T n = Prox γ n +1 ,g ( θ n − γ n +1 ∇ f ( θ n )) . Then there exists θ ⋆ ∈ L such that lim n θ n = θ ⋆ .

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Sketch of proof Its proof relies on a deterministic Lyapunov inequality 1 � θ n +1 − θ ⋆ � 2 ≤ � θ n − θ ⋆ � 2 − 2 γ n +1 � � � � + 2 γ 2 n +1 � η n +1 � 2 F ( θ n +1 ) − min F − 2 γ n +1 T n − θ ⋆ , η n +1 � �� non-negative signed noise (an extension of) the Robbins-Siegmund lemma 2 Let { v n , n ≥ 0 } and { χ n , n ≥ 0 } be non-negative sequences and { ξ n , n ≥ 0 } be such that � n ξ n exists. If for any n ≥ 0 , v n +1 ≤ v n − χ n +1 + ξ n +1 then � n χ n < ∞ and lim n v n exists. Rmk: deterministic lemma, signed noise.

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods What about Nesterov-based acceleration ? (FISTA) Let { t n , n ≥ 0 } be a positive sequence s.t. γ n +1 t n ( t n − 1) ≤ γ n t 2 n − 1 Nesterov acceleration of the Proximal Gradient algorithm θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 ∇ f ( τ n )) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) � 1 � (deterministic) Proximal-gradient F ( θ n ) − min F = O n � 1 � (deterministic) Accelerated Proximal-gradient F ( θ n ) − min F = O n 2

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish A numerical solution: proximal-gradient based methods Convergence results for perturbed FISTA When ∇ f ( τ n ) is replaced with H n +1 Perturbed FISTA H n +1 ≈ ∇ f ( τ n ) θ n +1 = Prox γ n +1 ,g ( τ n − γ n +1 H n +1 ) τ n +1 = θ n +1 + t n − 1 t n +1 ( θ n +1 − θ n ) def Under conditions on γ n , t n and on the perturbation ˜ η n +1 = H n +1 − ∇ f ( τ n ) � γ n +1 t n � z n − θ ∗ , ˜ η n +1 � < ∞ n we have (F., Risser, Atchad´ e, Moulines; 2018) lim n γ n +1 t 2 n F ( θ n ) exists Explicit control of this quantity.

Stochastic approximation-based algorithms, when the Monte Carlo bias - PowerPoint PPT Presentation

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math ematiques de Toulouse CNRS Toulouse,

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

6. Approximation and fitting norm approximation least-norm problems regularized

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Advanced Algorithms COMS31900 Approximation algorithms part four Asymptotic Polynomial Time

Advanced Algorithms COMS31900 Approximation algorithms part two more constant factor

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

Approximation Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Approximation Algorithms

Lecture: Approximation Algorithms Lecture: Approximation Algorithms Jannik Matuschke November 5,

Simulation Monte Carlo Monte Carlo simulation Outcome of a single stochastic simulation run

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

v F c v F c 2 1 4 3 4 v < 3 v < 2 v < 1 v v F c (

Preservation of prox-regularity Florent Nacry 1 Based on a joint work with Samir Adly and Lionel

Introduction to Mobile Robotics Proximity Sensors Wolfram Burgard, Cyrill Stachniss, Maren

STOCHASTIC PROXIMAL LANGEVIN ALGORITHM Adil Salim Joint work with Dmitry Kovalev and Peter

Stochastic approximation-based algorithms, when the Monte Carlo bias - PowerPoint PPT Presentation

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende Fort Institut de Math ematiques de Toulouse CNRS Toulouse,

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

6. Approximation and fitting norm approximation least-norm problems regularized

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Advanced Algorithms COMS31900 Approximation algorithms part four Asymptotic Polynomial Time

Advanced Algorithms COMS31900 Approximation algorithms part two more constant factor

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &amp;

Approximation Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Approximation Algorithms

Lecture: Approximation Algorithms Lecture: Approximation Algorithms Jannik Matuschke November 5,

Simulation Monte Carlo Monte Carlo simulation Outcome of a single stochastic simulation run

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

v F c v F c 2 1 4 3 4 v &lt; 3 v &lt; 2 v &lt; 1 v v F c (

Preservation of prox-regularity Florent Nacry 1 Based on a joint work with Samir Adly and Lionel

Introduction to Mobile Robotics Proximity Sensors Wolfram Burgard, Cyrill Stachniss, Maren

STOCHASTIC PROXIMAL LANGEVIN ALGORITHM Adil Salim Joint work with Dmitry Kovalev and Peter

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

v F c v F c 2 1 4 3 4 v < 3 v < 2 v < 1 v v F c (