Convergence of perturbed Proximal Gradient algorithms Gersende Fort - PowerPoint PPT Presentation

Convergence of perturbed Proximal Gradient algorithms Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math´ ematiques de Toulouse CNRS and Univ. Paul Sabatier Toulouse, France

Convergence of perturbed Proximal Gradient algorithms Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) ֒ → On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). → Acceleration for perturbed Proximal Gradient algorithms (work ֒ in progress) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). ֒ → Penalized inference in Mixed Models by Proximal Gradient methods (work in progress)

Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Example of model F : monocompartimental, oral administration � � exp( − Cl F ( t, [ln Cl , ln V , ln A ]) = C ( Cl,V,A,D ) V t ) − exp( − A t ) For each patient i ,         β 1 ,Cl Z i 1 ,Cl + · · · + β K,Cl Z i ln Cl β 0 ,Cl d Cl ,i K,Cl idem, with covariates Z i ln V = β 0 ,V  + k,V and coefficients β k,V  + d V ,i       idem, with covariates Z i ln A β 0 ,A k,A and coefficients β k,A d A ,i i

Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of θ = ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood

Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: Likelihood: not explicit. Complete likelihood: the distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J } has an explicit expression. ML: here, the likelihood is not concave.

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Outline Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis Conclusion

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Penalized Maximum Likelihood inference with intractable Likelihood N observations : Y = ( Y 1 , · · · , Y N ) A parametric statistical model θ ∈ Θ ⊆ R d the dependance upon Y is omitted θ �→ L ( θ ) likelihood of the observations θ �→ g ( θ ) ≥ 0 A penalty term on the parameter θ : for sparsity constraints on θ . Usually, g non-smooth and convex. Goal: Computation of � 1 � θ �→ argmax θ ∈ Θ N log L ( θ ) − g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a positive σ -finite measure on a set X. x collects the missing/latent data. In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed form expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) X Under regularity conditions, θ �→ log L ( θ ) is C 1 and � X ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � X p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � X p θ ( z ) µ ( d z ) X � �� the a posteriori distribution

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) X Under regularity conditions, θ �→ log L ( θ ) is C 1 and � X ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � X p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � X p θ ( z ) µ ( d z ) X � �� the a posteriori distribution The gradient of the log-likelihood � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X is an intractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , ∂ θ log p θ ( x ) can be evaluated.

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Approximation of the gradient � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain.

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Approximation of the gradient � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain. Stochastic approximation of the gradient A biased approximation, since for MCMC samples X j,θ � E [ h ( X j,θ )] � = h ( x ) π θ ( d x ) . If the Markov chain is ergodic, the bias vanishes when j → ∞ .

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution   p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j )   k =1 1 ≤ j<k ≤ p = 1 �� θ, ¯ Z θ exp B ( y ) where B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.

Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is intractable.

Convergence of perturbed Proximal Gradient algorithms Gersende Fort - PowerPoint PPT Presentation

Convergence of perturbed Proximal Gradient algorithms Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques de Toulouse CNRS and Univ. Paul Sabatier Toulouse, France Convergence of perturbed Proximal

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

Linear Solvers for Singularly Perturbed Problems Numerical Analysis for Singularly Perturbed

The Perturbed The Perturbed Carbon Cycle Carbon Cycle EES 3310/5310 EES 3310/5310 Global

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

On the global convergence of a singularly perturbed parabolic problem of reaction diffusion type

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

Fast direct solvers for elliptic partial differential equations on locally-perturbed geometries

Nonlinear Control Lecture # 8 Time Varying and Perturbed Systems Nonlinear Control Lecture # 8

Proximal Identification and Applications J er ome MALICK CNRS, Lab. J. Kuntzmann, Grenoble

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau

iPiano: Inertial Proximal Algorithm for Non-Convex Optimization David Stutz June 2, 2016 David

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

Complexity of a quadratic penalty accelerated inexact proximal point method W. Kong 1 J.G. Melo 2

Using the DMM to unde r stand and r e spond to De ve lopme ntal T r auma in Child Pr ote c

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

On Corson and Valdivia compact spaces* Reynaldo Rojas Hern andez Centro de Ciencias Matem