stochastic perturbations of proximal gradient methods for
play

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth - PowerPoint PPT Presentation

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian


  1. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Gersende Fort LTCI, CNRS and Telecom ParisTech Paris, France Based on joint works with Eric Moulines (Ecole Polytechnique, France) Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (Univ. Bordeaux, France) and Charles Dossal (Univ. Bordeaux, France) → On Perturbed Proximal-Gradient algorithms (2016-v3, arXiv) ֒

  2. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

  3. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Penalized Maximum Likelihood inference, latent variable model N observations : Y = ( Y 1 , · · · , Y N ) A negative normalized log-likelihood of the observations Y, in a latent variable model � θ �→ − 1 N log L ( Y , θ ) L ( Y , θ ) = p θ ( x, Y ) µ ( d x ) where θ ∈ Θ ⊂ R d . A penalty term on the parameter θ : θ �→ g ( θ ) for sparsity constraints on θ ; usually non-smooth and convex. Goal: Computation of � � − 1 θ �→ argmin θ ∈ Θ N log L ( Y , θ ) + g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.

  4. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� � � �� � fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter

  5. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� � � �� � fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter Example: logistic regression Y 1 , · · · , Y N binary independent observations: Bernoulli r.v. with mean p i = exp( η i ) / (1 + exp( η i )) N exp( Y i η i ) � ( Y 1 , · · · , Y N ) | U ≡ 1 + exp( η i ) i =1 Gaussian random effect: U ∼ N q .

  6. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� � the a posteriori distribution

  7. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� � the a posteriori distribution The gradient of the log-likelihood � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , H θ ( x ) can be evaluated.

  8. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1

  9. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1 Stochastic approximation of the gradient a biased approximation � � M � 1 � E H θ ( X m,θ ) � = H θ ( x ) π θ ( d x ) . M m =1 if the chain is ergodic ”enough”, the bias vanishes when M → ∞ .

  10. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ .

  11. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ . Difficulties: biased stochastic perturbation of the gradient gradient-based methods in the Stochastic Approximation framework (a fixed number of Monte Carlo samples) weaker conditions on the stochastic perturbation.

  12. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

  13. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Perturbed gradient algorithm Algorithm: Given a stepsize/learning rate sequence { γ n , n ≥ 0 } : Initialisation: θ 0 ∈ Θ Repeat: compute H n +1 , an approximation of ∇ f ( θ n ) θ n +1 = θ n − γ n +1 H n +1 . set M. Bena¨ ım. Dynamics of stochastic approximation algorithms. S´ eminaire de Probabilit´ es de Strasbourg (1999) A. Benveniste, M. M´ etivier and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, New York, 1990. V. Borkar. Stochastic Approximation: a dynamical systems viewpoint. Cambridge Univ. Press (2008). M. Duflo, Random Iterative Systems, Appl. Math. 34, Springer-Verlag, Berlin, 1997. H. Kushner, G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Book (2003).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend