Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Gersende Fort Institut de Math´ ematiques de Toulouse, CNRS Toulouse, France

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Eric Moulines (Ecole Polytechnique, France) Adeline Samson et Edouard Ollier (Univ. Grenoble Alpes, France). Charles Dossal (IMT). Laurent Risser (IMT) → On Perturbed Proximal-Gradient algorithms (JMLR, 2017) ֒ → Stochastic Proximal Gradient Algorithms for Penalized Mixed ֒ Models (Stat & Computing, 2018) ֒ → Acceleration for perturbed Proximal Gradient algorithms (work in progress) ֒ → Algorithmes Gradient Proximaux Stochastiques (GRETSI, 2017)

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Outline Motivations Pharmacokinetic General Case: Latent Variable Models Votes in the US congress General case: Discrete graphical models Conclusion, part I Penalized ML through Perturbed Stochastic-Gradient algorithms Asymptotic behavior of the algorithm Numerical illustration

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Example of model F : monocompartimental, with digestive absorption � � exp( − Cl F ( t, [ln Cl , ln V , ln A ]) = C ( Cl,V,A,D ) V t ) − exp( − A t ) For each patient i ,         β 1 ,Cl Z i 1 ,Cl + · · · + β K,Cl Z i ln Cl β 0 ,Cl d Cl ,i K,Cl idem, with covariates Z i ln V = β 0 ,V  + k,V and coefficients β k,V  + d V ,i       idem, with covariates Z i ln A β 0 ,A k,A and coefficients β k,A d A ,i i

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of θ = ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: Complete likelihood: the distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } has an explicit expression. � N � N � � J i � � � N ( f ( t ij , X j ) , σ 2 )[ Y ij ] N L ( Z i β, Ω)[ X i ] i =1 j =1 i =1 Likelihood: the distribution of { Y ij ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } is not explicit . ML: here, the likelihood is not concave .

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models General case: Latent variable models The log-likelihood of the observations Y is of the form (dependende upon Y is omitted) � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a σ -finite positive measure on a set X. x collects the missing/latent data. previous example: x ← ( X 1 , · · · , XN ) , µ ← lebesgue on R LN In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� the a posteriori distribution

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� the a posteriori distribution The gradient of the log-likelihood � ∇ θ { log L ( θ ) } = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y (known up to a constant) For all ( x, θ ) , H θ ( x ) can be evaluated.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (1/2) p nodes in a graph (e.g. p senators from the US congress) each node takes values in {− 1 , 1 } (e.g. each node codes for no/yes in a vote) N pictures of the graph (e.g. N votes) Model: Each observation Y ( i ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 Statistical Analysis: estimation of θ , under penalty (sparse graph, regularization N << p 2 / 2 ) classification of the nodes → Penalized Maximum Likelihood ֒

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (2/2) Model: Each observation Y ( n ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p π θ ( y ) = 1 � � � Z θ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 def = ( Y (1) , · · · , Y ( N ) ) Log-Likelihood: Y � N � N � � p p − 1 p � � � � � Y ( n ) Y ( n ) Y ( n ) − N log Z θ ℓ ( θ ) = θ i + θ ij i i j i =1 n =1 i =1 j = i +1 n =1 = � Θ , S ( Y ) � − N log Z θ = � Ψ( θ ) , S ( Y ) � + Φ( θ ) Likelihood : not explicit � p � p − 1 p � � � � def Z θ = exp θ i y i + θ ij y i y j y ∈{− 1 , 1 } p i =1 i =1 j = i +1 ML: here, the likelihood is concave.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models General Case: Discrete graphical models N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y ( i ) in X p with distribution   p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j )   k =1 1 ≤ j<k ≤ p = 1 �� θ, ¯ Z θ exp B ( y ) where ¯ B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Likelihood ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is untractable.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Gradient of the likelihood ◮ Gradient of the form � 1 � N � = 1 � ¯ ¯ ∇ θ N log L ( θ ) B ( Y i ) − B ( y ) π θ ( y ) µ ( d y ) N X p i =1 with 1 �� def θ, ¯ π θ ( y ) = Z θ exp B ( y ) . The gradient of the (log)-likelihood is untractable

Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour linf erence statistique Algorithmes Gradient-Proximaux pour linf erence statistique Gersende Fort Institut de Math ematiques de Toulouse, CNRS Toulouse, France Algorithmes Gradient-Proximaux

Inf erence p enalis ee dans les mod` eles ` a vraisemblance non explicite par des

CNBC Matlab Mini-Course Inf and NaN 3/0 returns Inf 0/0 returns NaN David S. Touretzky

Algorithms for integer factorization and discrete logarithms computation Algorithmes pour la

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

Dipl.-Inf. Robert Manthey Dipl.-Inf. Robert Manthey 15. November 2017 1 Dipl.-Inf. Robert

Software Failures Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar

Software Failures Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar

Co Co-Inf Inference erence wi with th De Device ice-Edge Edge Sy Syne nerg rgy En Li,

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

New I nt egrat ed Modelling New I nt egrat ed Modelling wit h Special Ref erence t o wit h

The architecture of Kaligreen V2: A middleware aware of hardware opportunities to save energy

Strong Verb Conjugation & Anomalous Verbs M&R 110114, 128130 ENG240Y Old

Preuves Formelles dIn egalit es et Programmation Semi-D efinie Directeur: Benjamin

A new stable surface in the Heisenberg group Fellow: Sebastiano Nicolussi Golo Advisors:

Music Informatics Alan Smaill Jan 21 2014 Alan Smaill Music Informatics Jan 21 2014 1/24

Quantum Computing Jim Royer CIS 675 Algorithms April 24, 2019 . . . Crypto (CIS 675)

STRUCTS IN C THE SHORTEST DISTANCE BETWEEN TWO POINTS IS UNDER CON STRUCT ION -- NOELIE ALTITO

Reduplication-sensitive phonology is regular Response to Hayes & Jo 2019 ms. Hossep Dolatian,

Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour linf erence statistique Algorithmes Gradient-Proximaux pour linf erence statistique Gersende Fort Institut de Math ematiques de Toulouse, CNRS Toulouse, France Algorithmes Gradient-Proximaux

Inf erence p enalis ee dans les mod` eles ` a vraisemblance non explicite par des

CNBC Matlab Mini-Course Inf and NaN 3/0 returns Inf 0/0 returns NaN David S. Touretzky

Algorithms for integer factorization and discrete logarithms computation Algorithmes pour la

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

Dipl.-Inf. Robert Manthey Dipl.-Inf. Robert Manthey 15. November 2017 1 Dipl.-Inf. Robert

Software Failures Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar

Software Failures Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar

Co Co-Inf Inference erence wi with th De Device ice-Edge Edge Sy Syne nerg rgy En Li,

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

New I nt egrat ed Modelling New I nt egrat ed Modelling wit h Special Ref erence t o wit h

The architecture of Kaligreen V2: A middleware aware of hardware opportunities to save energy

Strong Verb Conjugation &amp; Anomalous Verbs M&amp;R 110114, 128130 ENG240Y Old

Preuves Formelles dIn egalit es et Programmation Semi-D efinie Directeur: Benjamin

A new stable surface in the Heisenberg group Fellow: Sebastiano Nicolussi Golo Advisors:

Music Informatics Alan Smaill Jan 21 2014 Alan Smaill Music Informatics Jan 21 2014 1/24

Quantum Computing Jim Royer CIS 675 Algorithms April 24, 2019 . . . Crypto (CIS 675)

STRUCTS IN C THE SHORTEST DISTANCE BETWEEN TWO POINTS IS UNDER CON STRUCT ION -- NOELIE ALTITO

Reduplication-sensitive phonology is regular Response to Hayes &amp; Jo 2019 ms. Hossep Dolatian,

Strong Verb Conjugation & Anomalous Verbs M&R 110114, 128130 ENG240Y Old

Reduplication-sensitive phonology is regular Response to Hayes & Jo 2019 ms. Hossep Dolatian,