online k mle for mixture modelling with exponential
play

Online k -MLE for mixture modelling with exponential families - PowerPoint PPT Presentation

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay Application Context We are interested in building a


  1. Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay

  2. Application Context We are interested in building a system (a model) which evolves when new data is available: x 1 , x 2 , . . . , x N , . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X 2/27

  3. Outline of this talk Online learning exponential families 1 Online learning of mixture of exponential families 2 Introduction, EM, k -MLE Recursive EM, Online EM Stochastic approximations of k -MLE Experiments Conclusions 3 3/27

  4. Reminder : (Regular) Exponential Family Firstly, π will be approximated by a member of a (regular) exponential family (EF): E F = { f ( x ; θ ) = exp {� s ( x ) , θ � + k ( x ) − F ( θ ) | θ ∈ Θ } Terminology: λ source parameters. F ( θ ) the log-normalizer: θ natural parameters. differentiable, strictly convex η expectation parameters. Θ = { θ ∈ R D | F ( θ ) < ∞} s ( x ) sufficient statistic. is an open convex set k ( x ) auxiliary carrier measure. Almost all common distributions are EF members but uniform, Cauchy distributions. 4/27

  5. Reminder : Maximum Likehood Estimate (MLE) Maximum Likehood Estimate for general p.d.f: N N − 1 θ ( N ) = argmax ˆ � � f ( x i ; θ ) = argmin log f ( x i ; θ ) N θ θ i =1 i =1 assuming a sample χ = { x 1 , x 2 , ..., x N } of i.i.d observations. Maximum Likehood Estimate for an EF: � � � � 1 θ ( N ) = argmin ˆ � − s ( x i ) , θ − cst ( χ ) + F ( θ ) N θ i which is exactly solved in H , the space of expectation parameters: � � θ ( N ) ) = 1 1 η ( N ) = ∇ F (ˆ θ ( N ) = ( ∇ F ) − 1 � ˆ � ˆ s ( x i ) ≡ s ( x i ) N N i i 5/27

  6. Exact Online MLE for exponential family A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input : a sequence S of observations Input : Functions s and ( ∇ F ) − 1 for some EF Output : a sequence of MLE for all observations seen before η (0) = 0; ˆ N = 1; for x N ∈ S do η ( N ) = ˆ η ( N − 1) + N − 1 ( s ( x N ) − ˆ η ( N − 1) ); ˆ η ( N ) or yield ( ∇ F ) − 1 (ˆ η ( N ) ); yield ˆ N = N + 1; Analytical expressions of ( ∇ F ) − 1 exist for most EF (but not all) 6/27

  7. Case of Multivariate normal distribution (MVN) Probability density function of MVN: 2 ( x − µ ) T Σ − 1 ( x − µ ) N ( x ; µ, Σ) = (2 π ) − d 2 | Σ | − 1 2 exp − 1 One possible decomposition: N ( x ; θ 1 , θ 2 ) = exp {� θ 1 , x � + � θ 2 , − xx T � F − 1 2 θ 1 − d 2 log( π ) + 1 t θ 1 θ − 1 2 log | θ 2 |} 4 � s ( x ) = ( x , − xx T ) = ⇒ ( ∇ F ) − 1 ( η 1 , η 2 ) = 1 − η 2 ) − 1 η 1 , 1 1 − η 2 ) − 1 � � ( − η 1 η T 2 ( − η 1 η T 7/27

  8. Case of the Wishart distribution See details in the paper. 8/27

  9. Finite (parametric) mixture models Now, π will be approximated by a finite (parametric) mixture f ( · ; θ ) indexed by θ : K K � � π ( x ) ≈ f ( x ; θ ) = w j f j ( x ; θ j ) , 0 ≤ w j ≤ 1 , w j = 1 j =1 j =1 where w j are the mixing proportions, f j are the component distributions. When all f j ’s are EFs, it is called a Mixture of EFs (MEF). 0.25 Unknown true distribution f* Mixture distribution f 0.1 * dnorm(x) + 0.6 * dnorm(x, 4, 2) + 0.3 * dnorm(x, −2, 0.5) Components density functions f_j 0.20 0.15 0.10 0.05 0.00 9/27 −5 0 5 10 x

  10. Incompleteness in mixture models incomplete complete deterministic observable ← unobservable χ = { x 1 , . . . , x N } χ c = { y 1 = ( x 1 , z 1 ) , . . . , y N } Z i ∼ cat K ( w ) X i | Z i = j ∼ f j ( · ; θ j ) For a MEF, the joint density p ( x , z ; θ ) is an EF: K � log p ( x , z ; θ ) = [ z = j ] { log( w j ) + � θ j , s j ( x ) � + k j ( x ) − F j ( θ j ) } j =1 K �� [ z = j ] � � log w j − F j ( θ j ) �� � = , + k ( x , z ) [ z = j ] s j ( x ) θ j j =1 10/27

  11. Expectation-Maximization (EM) [1] The EM algorithm maximizes iteratively Q ( θ ; ˆ θ ( t ) , χ ). Algorithm 2: EM algorithm θ (0) initial parameters of the model Input : ˆ Input : χ ( N ) = { x 1 , . . . , x N } θ ( t ∗ ) of log f ( χ ; θ ) Output : A (local) maximizer ˆ t ← 0; repeat Compute Q ( θ ; ˆ θ ( t ) , χ ) := E ˆ θ ( t ) [log p ( χ c ; θ ) | χ ] ; // E-Step θ ( t +1) = argmax θ Q ( θ ; ˆ Choose ˆ θ ( t ) , χ ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood ; 11/27

  12. EM for MEF For a mixture, the E-Step is always explicit: z ( t ) w ( t ) f ( x i ; ˆ θ ( t ) w ( t ) j ′ f ( x i ; ˆ θ ( t ) � ˆ i , j = ˆ ) / ˆ j ′ ) j j j ′ For a MEF, the M-Step then reduces to: K �� z ( t ) � �� � i ˆ � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � i , j , z ( t ) θ j � i ˆ i , j s j ( x i ) { w j ,θ j } j =1 N w ( t +1) z ( t ) � ˆ = ˆ i , j / N j i =1 z ( t ) � i ˆ i , j s j ( x i ) η ( t +1) θ ( t +1) = ∇ F (ˆ ˆ ) = ( weighted average of SS ) j j z ( t ) � i ˆ i , j 12/27

  13. k -Maximum Likelihood Estimator ( k -MLE) [2] χ ( t ) The k-MLE introduces a geometric split χ = � K j =1 ˆ to j accelerate EM : z ( t ) w j ′ f ( x i ; ˆ θ ( t ) ˜ i , j = [argmax j ′ ) = j ] j ′ Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k -MLE then reduces to: χ ( t ) K �� � �� | ˆ j | � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � , � s j ( x i ) θ j χ ( t ) { w j ,θ j } x i ∈ ˆ j =1 j � s j ( x i ) χ ( t ) x i ∈ ˆ w ( t +1) χ ( t ) η ( t +1) = ∇ F (ˆ θ ( t +1) j ˆ = | ˆ j | / N ˆ ) = j j j χ ( t ) | ˆ j | ( cluster-wise unweighted average of SS ) 13/27

  14. Online learning of mixtures Consider now the online setting x 1 , x 2 , . . . , x N , . . . θ ( N ) or ˆ η ( N ) the parameter estimate after dealing N Denote ˆ observations θ (0) or ˆ η (0) their initial values Denote ˆ Remark: For a fixed-size dataset χ , one may apply multiple passes (with shuffle) on χ . The increase in the likelihood function is no more guaranteed after an iteration. 14/27

  15. Stochastic approximations of EM(1) Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] θ ( N ) = ˆ θ ( N − 1) + { NI c (ˆ ˆ θ ( N − 1) } − 1 ∇ θ log f ( x N ; ˆ θ ( N − 1) ) where I c is the Fisher Information matrix for the complete data: � log p ( x , z ; θ ) � I c (ˆ θ ( N − 1) ) = − E ˆ θ ( N − 1) ∂θ∂θ T j A justification for this formula comes from the Fisher’s Identity: ∇ log f ( x ; θ ) = E θ [log p ( x , z ; θ ) | x ] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert I c after each iteration. 15/27

  16. Stochastic approximations of EM(2) Stochastic E-Step : Online EM (2009) [7] Q ( N − 1) ( θ )+ α ( N ) � � Q ( N ) ( θ ) = ˆ ˆ θ ( N − 1) [log p ( x N , z N ; θ ) | x N ] − ˆ Q ( N − 1) ( θ ) E ˆ In case of a MEF, the algorithm works only with the cond. expectation of the sufficient statistics for complete data. z N , j = E θ ( N − 1) [ z N , j | x N ] ˆ � ˆ � ˆ � ˆ S ( N ) � S ( N − 1) � �� S ( N − 1) �� � z N , j ˆ w j w j w j + α ( N ) = − S ( N ) S ( N − 1) S ( N − 1) ˆ ˆ ˆ z N , j s j ( x N ) ˆ θ j θ j θ j The M -Step is unchanged: w ( N ) η ( N ) S ( N ) = ˆ ˆ = ˆ w j w j j θ ( N ) η ( N ) S ( N ) S ( N ) ˆ = ˆ / ˆ = ( ∇ F j ) − 1 (ˆ w j ) j θ j θ j 16/27

  17. Stochastic approximations of EM(3) Some properties: S (0) may be used for introducing a ”prior”: Initial values ˆ S (0) S (0) = w j η (0) ˆ w j = w j , ˆ θ j j Parameters constraints are automatically respected No matrix to invert ! Policy for α ( N ) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! 17/27

  18. Stochastic approximations of k-MLE(1) In order to keep previous advantages of online EM for an online k -MLE, our only choice concerns the way to affect x N to a cluster. Strategy 1 Maximize the likelihood of the complete data ( x N , z N ) w ( N − 1) θ ( N − 1) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Equivalent to Online CEM and similar to Mac-Queen iterative k-Means. 18/27

  19. Stochastic approximations of k-MLE(2) Strategy 2 Maximize the likelihood of the complete data ( x N , z N ) after the M-Step: w ( N ) θ ( N ) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Similar to Hartigan’s method for k -means. Additional cost: pre-compute all possible M-Steps for the Stochastic E -Step. 19/27

  20. Stochastic approximations of k-MLE(3) Strategy 3 Draw ˜ z N , j from the categorical distribution w ( N − 1) θ ( N − 1) f j ( x N ; ˆ z N sampled from Cat K ( { p j = log( ˆ ˜ )) } j ) j j Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k -MLE. For strategies 1 and 3, the M -Step reduces the update of the parameters for a single component. 20/27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend