Stochastic Perturbations of Proximal-Gradient methods for nonsmooth - PowerPoint PPT Presentation

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Gersende Fort LTCI, CNRS and Telecom ParisTech Paris, France Based on joint works with Eric Moulines (Ecole Polytechnique, France) Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (Univ. Bordeaux, France) and Charles Dossal (Univ. Bordeaux, France) → On Perturbed Proximal-Gradient algorithms (2016-v3, arXiv) ֒

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Penalized Maximum Likelihood inference, latent variable model N observations : Y = ( Y 1 , · · · , Y N ) A negative normalized log-likelihood of the observations Y, in a latent variable model � θ �→ − 1 N log L ( Y , θ ) L ( Y , θ ) = p θ ( x, Y ) µ ( d x ) where θ ∈ Θ ⊂ R d . A penalty term on the parameter θ : θ �→ g ( θ ) for sparsity constraints on θ ; usually non-smooth and convex. Goal: Computation of � � − 1 θ �→ argmin θ ∈ Θ N log L ( Y , θ ) + g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter Example: logistic regression Y 1 , · · · , Y N binary independent observations: Bernoulli r.v. with mean p i = exp( η i ) / (1 + exp( η i )) N exp( Y i η i ) � ( Y 1 , · · · , Y N ) | U ≡ 1 + exp( η i ) i =1 Gaussian random effect: U ∼ N q .

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� the a posteriori distribution

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� the a posteriori distribution The gradient of the log-likelihood � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , H θ ( x ) can be evaluated.

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1 Stochastic approximation of the gradient a biased approximation � � M � 1 � E H θ ( X m,θ ) � = H θ ( x ) π θ ( d x ) . M m =1 if the chain is ergodic ”enough”, the bias vanishes when M → ∞ .

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ .

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ . Difficulties: biased stochastic perturbation of the gradient gradient-based methods in the Stochastic Approximation framework (a fixed number of Monte Carlo samples) weaker conditions on the stochastic perturbation.

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Perturbed gradient algorithm Algorithm: Given a stepsize/learning rate sequence { γ n , n ≥ 0 } : Initialisation: θ 0 ∈ Θ Repeat: compute H n +1 , an approximation of ∇ f ( θ n ) θ n +1 = θ n − γ n +1 H n +1 . set M. Bena¨ ım. Dynamics of stochastic approximation algorithms. S´ eminaire de Probabilit´ es de Strasbourg (1999) A. Benveniste, M. M´ etivier and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, New York, 1990. V. Borkar. Stochastic Approximation: a dynamical systems viewpoint. Cambridge Univ. Press (2008). M. Duflo, Random Iterative Systems, Appl. Math. 34, Springer-Verlag, Berlin, 1997. H. Kushner, G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Book (2003).

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth - PowerPoint PPT Presentation

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

N formalism for curvature perturbations formalism for curvature perturbations from inflation

P Perturbations Perturbations P t t b ti b ti in Lee in Lee in Lee Wick Bouncing Universe

Measuring Perturbations Measuring Perturbations with Weak Lensing of SNe with Weak Lensing of

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Project: More Experiments on Stochastic Gradient Methods Last updated: May 25, 2020 May 25, 2020

ADVANCED ALGORITHMS 2 LECTURE 8 ANNOUNCEMENTS Homework 2 out Wednesday Friday

Structural Case on Adverbials Arto Anttila and Jong-Bok Kim Stanford University & Kyung Hee

1. Structure of a nominal in Korean background Main grammatical positions 1 in a nominal [Chang

:'#(#;pwwffi- esot) tlt^^ t; ,r0.".^, [,rkt'l x> o1 , t w? \^,^W"^J

Assessing Network Rails delivery of Network Availability in CP6 SNC-Lavalin Transport

Access A deeper look at RMA Outline Additional MPI RMA concepts - Synchronization modes -

Theoretical Neutrino Physics Lecture Notes Joachim Kopp August 7, 2019 Contents 1 Notation

A generalized MBO diffusion generated method for constrained harmonic maps Braxton Osting

Sambuz

Useful Links

Newsletter

Mail Us

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth - PowerPoint PPT Presentation

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

N formalism for curvature perturbations formalism for curvature perturbations from inflation

P Perturbations Perturbations P t t b ti b ti in Lee in Lee in Lee Wick Bouncing Universe

Measuring Perturbations Measuring Perturbations with Weak Lensing of SNe with Weak Lensing of

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Project: More Experiments on Stochastic Gradient Methods Last updated: May 25, 2020 May 25, 2020

ADVANCED ALGORITHMS 2 LECTURE 8 ANNOUNCEMENTS Homework 2 out Wednesday Friday

Structural Case on Adverbials Arto Anttila and Jong-Bok Kim Stanford University &amp; Kyung Hee

1. Structure of a nominal in Korean background Main grammatical positions 1 in a nominal [Chang

:'#(#;pwwffi- esot*) tlt^^ t; ,r0.&quot;.^, [,rkt'l x&gt; o1 , *t w? \^,^W&quot;^J

Assessing Network Rails delivery of Network Availability in CP6 SNC-Lavalin Transport

Access A deeper look at RMA Outline Additional MPI RMA concepts - Synchronization modes -

Theoretical Neutrino Physics Lecture Notes Joachim Kopp August 7, 2019 Contents 1 Notation

A generalized MBO diffusion generated method for constrained harmonic maps Braxton Osting

Sambuz

Useful Links

Newsletter

Mail Us

Structural Case on Adverbials Arto Anttila and Jong-Bok Kim Stanford University & Kyung Hee

:'#(#;pwwffi- esot) tlt^^ t; ,r0.".^, [,rkt'l x> o1 , t w? \^,^W"^J