Online k -MLE for mixture modelling with exponential families - PowerPoint PPT Presentation

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay

Application Context We are interested in building a system (a model) which evolves when new data is available: x 1 , x 2 , . . . , x N , . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X 2/27

Outline of this talk Online learning exponential families 1 Online learning of mixture of exponential families 2 Introduction, EM, k -MLE Recursive EM, Online EM Stochastic approximations of k -MLE Experiments Conclusions 3 3/27

Reminder : (Regular) Exponential Family Firstly, π will be approximated by a member of a (regular) exponential family (EF): E F = { f ( x ; θ ) = exp {� s ( x ) , θ � + k ( x ) − F ( θ ) | θ ∈ Θ } Terminology: λ source parameters. F ( θ ) the log-normalizer: θ natural parameters. differentiable, strictly convex η expectation parameters. Θ = { θ ∈ R D | F ( θ ) < ∞} s ( x ) sufficient statistic. is an open convex set k ( x ) auxiliary carrier measure. Almost all common distributions are EF members but uniform, Cauchy distributions. 4/27

Reminder : Maximum Likehood Estimate (MLE) Maximum Likehood Estimate for general p.d.f: N N − 1 θ ( N ) = argmax ˆ � � f ( x i ; θ ) = argmin log f ( x i ; θ ) N θ θ i =1 i =1 assuming a sample χ = { x 1 , x 2 , ..., x N } of i.i.d observations. Maximum Likehood Estimate for an EF: � � � � 1 θ ( N ) = argmin ˆ � − s ( x i ) , θ − cst ( χ ) + F ( θ ) N θ i which is exactly solved in H , the space of expectation parameters: � � θ ( N ) ) = 1 1 η ( N ) = ∇ F (ˆ θ ( N ) = ( ∇ F ) − 1 � ˆ � ˆ s ( x i ) ≡ s ( x i ) N N i i 5/27

Exact Online MLE for exponential family A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input : a sequence S of observations Input : Functions s and ( ∇ F ) − 1 for some EF Output : a sequence of MLE for all observations seen before η (0) = 0; ˆ N = 1; for x N ∈ S do η ( N ) = ˆ η ( N − 1) + N − 1 ( s ( x N ) − ˆ η ( N − 1) ); ˆ η ( N ) or yield ( ∇ F ) − 1 (ˆ η ( N ) ); yield ˆ N = N + 1; Analytical expressions of ( ∇ F ) − 1 exist for most EF (but not all) 6/27

Case of Multivariate normal distribution (MVN) Probability density function of MVN: 2 ( x − µ ) T Σ − 1 ( x − µ ) N ( x ; µ, Σ) = (2 π ) − d 2 | Σ | − 1 2 exp − 1 One possible decomposition: N ( x ; θ 1 , θ 2 ) = exp {� θ 1 , x � + � θ 2 , − xx T � F − 1 2 θ 1 − d 2 log( π ) + 1 t θ 1 θ − 1 2 log | θ 2 |} 4 � s ( x ) = ( x , − xx T ) = ⇒ ( ∇ F ) − 1 ( η 1 , η 2 ) = 1 − η 2 ) − 1 η 1 , 1 1 − η 2 ) − 1 � � ( − η 1 η T 2 ( − η 1 η T 7/27

Case of the Wishart distribution See details in the paper. 8/27

Finite (parametric) mixture models Now, π will be approximated by a finite (parametric) mixture f ( · ; θ ) indexed by θ : K K � � π ( x ) ≈ f ( x ; θ ) = w j f j ( x ; θ j ) , 0 ≤ w j ≤ 1 , w j = 1 j =1 j =1 where w j are the mixing proportions, f j are the component distributions. When all f j ’s are EFs, it is called a Mixture of EFs (MEF). 0.25 Unknown true distribution f* Mixture distribution f 0.1 * dnorm(x) + 0.6 * dnorm(x, 4, 2) + 0.3 * dnorm(x, −2, 0.5) Components density functions f_j 0.20 0.15 0.10 0.05 0.00 9/27 −5 0 5 10 x

Incompleteness in mixture models incomplete complete deterministic observable ← unobservable χ = { x 1 , . . . , x N } χ c = { y 1 = ( x 1 , z 1 ) , . . . , y N } Z i ∼ cat K ( w ) X i | Z i = j ∼ f j ( · ; θ j ) For a MEF, the joint density p ( x , z ; θ ) is an EF: K � log p ( x , z ; θ ) = [ z = j ] { log( w j ) + � θ j , s j ( x ) � + k j ( x ) − F j ( θ j ) } j =1 K �� [ z = j ] � � log w j − F j ( θ j ) �� = , + k ( x , z ) [ z = j ] s j ( x ) θ j j =1 10/27

Expectation-Maximization (EM) [1] The EM algorithm maximizes iteratively Q ( θ ; ˆ θ ( t ) , χ ). Algorithm 2: EM algorithm θ (0) initial parameters of the model Input : ˆ Input : χ ( N ) = { x 1 , . . . , x N } θ ( t ∗ ) of log f ( χ ; θ ) Output : A (local) maximizer ˆ t ← 0; repeat Compute Q ( θ ; ˆ θ ( t ) , χ ) := E ˆ θ ( t ) [log p ( χ c ; θ ) | χ ] ; // E-Step θ ( t +1) = argmax θ Q ( θ ; ˆ Choose ˆ θ ( t ) , χ ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood ; 11/27

EM for MEF For a mixture, the E-Step is always explicit: z ( t ) w ( t ) f ( x i ; ˆ θ ( t ) w ( t ) j ′ f ( x i ; ˆ θ ( t ) � ˆ i , j = ˆ ) / ˆ j ′ ) j j j ′ For a MEF, the M-Step then reduces to: K �� z ( t ) � �� i ˆ � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � i , j , z ( t ) θ j � i ˆ i , j s j ( x i ) { w j ,θ j } j =1 N w ( t +1) z ( t ) � ˆ = ˆ i , j / N j i =1 z ( t ) � i ˆ i , j s j ( x i ) η ( t +1) θ ( t +1) = ∇ F (ˆ ˆ ) = ( weighted average of SS ) j j z ( t ) � i ˆ i , j 12/27

k -Maximum Likelihood Estimator ( k -MLE) [2] χ ( t ) The k-MLE introduces a geometric split χ = � K j =1 ˆ to j accelerate EM : z ( t ) w j ′ f ( x i ; ˆ θ ( t ) ˜ i , j = [argmax j ′ ) = j ] j ′ Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k -MLE then reduces to: χ ( t ) K �� | ˆ j | � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � , � s j ( x i ) θ j χ ( t ) { w j ,θ j } x i ∈ ˆ j =1 j � s j ( x i ) χ ( t ) x i ∈ ˆ w ( t +1) χ ( t ) η ( t +1) = ∇ F (ˆ θ ( t +1) j ˆ = | ˆ j | / N ˆ ) = j j j χ ( t ) | ˆ j | ( cluster-wise unweighted average of SS ) 13/27

Online learning of mixtures Consider now the online setting x 1 , x 2 , . . . , x N , . . . θ ( N ) or ˆ η ( N ) the parameter estimate after dealing N Denote ˆ observations θ (0) or ˆ η (0) their initial values Denote ˆ Remark: For a fixed-size dataset χ , one may apply multiple passes (with shuffle) on χ . The increase in the likelihood function is no more guaranteed after an iteration. 14/27

Stochastic approximations of EM(1) Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] θ ( N ) = ˆ θ ( N − 1) + { NI c (ˆ ˆ θ ( N − 1) } − 1 ∇ θ log f ( x N ; ˆ θ ( N − 1) ) where I c is the Fisher Information matrix for the complete data: � log p ( x , z ; θ ) � I c (ˆ θ ( N − 1) ) = − E ˆ θ ( N − 1) ∂θ∂θ T j A justification for this formula comes from the Fisher’s Identity: ∇ log f ( x ; θ ) = E θ [log p ( x , z ; θ ) | x ] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert I c after each iteration. 15/27

Stochastic approximations of EM(2) Stochastic E-Step : Online EM (2009) [7] Q ( N − 1) ( θ )+ α ( N ) � � Q ( N ) ( θ ) = ˆ ˆ θ ( N − 1) [log p ( x N , z N ; θ ) | x N ] − ˆ Q ( N − 1) ( θ ) E ˆ In case of a MEF, the algorithm works only with the cond. expectation of the sufficient statistics for complete data. z N , j = E θ ( N − 1) [ z N , j | x N ] ˆ � ˆ � ˆ � ˆ S ( N ) � S ( N − 1) � �� S ( N − 1) �� z N , j ˆ w j w j w j + α ( N ) = − S ( N ) S ( N − 1) S ( N − 1) ˆ ˆ ˆ z N , j s j ( x N ) ˆ θ j θ j θ j The M -Step is unchanged: w ( N ) η ( N ) S ( N ) = ˆ ˆ = ˆ w j w j j θ ( N ) η ( N ) S ( N ) S ( N ) ˆ = ˆ / ˆ = ( ∇ F j ) − 1 (ˆ w j ) j θ j θ j 16/27

Stochastic approximations of EM(3) Some properties: S (0) may be used for introducing a ”prior”: Initial values ˆ S (0) S (0) = w j η (0) ˆ w j = w j , ˆ θ j j Parameters constraints are automatically respected No matrix to invert ! Policy for α ( N ) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! 17/27

Stochastic approximations of k-MLE(1) In order to keep previous advantages of online EM for an online k -MLE, our only choice concerns the way to affect x N to a cluster. Strategy 1 Maximize the likelihood of the complete data ( x N , z N ) w ( N − 1) θ ( N − 1) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Equivalent to Online CEM and similar to Mac-Queen iterative k-Means. 18/27

Stochastic approximations of k-MLE(2) Strategy 2 Maximize the likelihood of the complete data ( x N , z N ) after the M-Step: w ( N ) θ ( N ) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Similar to Hartigan’s method for k -means. Additional cost: pre-compute all possible M-Steps for the Stochastic E -Step. 19/27

Stochastic approximations of k-MLE(3) Strategy 3 Draw ˜ z N , j from the categorical distribution w ( N − 1) θ ( N − 1) f j ( x N ; ˆ z N sampled from Cat K ( { p j = log( ˆ ˜ )) } j ) j j Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k -MLE. For strategies 1 and 3, the M -Step reduces the update of the parameters for a single component. 20/27

Online k -MLE for mixture modelling with exponential families - PowerPoint PPT Presentation

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay Application Context We are interested in building a

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Choice with multiple alternatives 5.2 Specification of the deterministic part Michel

Sequential Detection and Isolation of a Correlated Pair Anamitra Chaudhuri Department of

Course : Data mining Lecture : Computing basic graph statistics Aristides Gionis Department of

slides of Layered Adaptive Importance Sampling Presentation June 2016 CITATION READS 1 40 3

Some results on convolution idempotents May 28, 2020 1 IIT Hyderabad, India 2 Stanford University

Sparse Coding and Dictionary Learning for Image Analysis Part IV: New sparse models Francis

Discussion Dean Foster Amazon @ NYC Differential privacy means in statistics language: Fit the

Specification of Landmarks and Forecasting Water Temperature Water Management in the River