on learning statistical mixtures maximizing the complete
play

On learning statistical mixtures maximizing the complete likelihood - PowerPoint PPT Presentation

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise,


  1. On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN ´ Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise, France c � 2014 Frank Nielsen 1/39

  2. Finite mixtures: Semi-parametric statistical models k � ◮ Mixture M ∼ MM ( W , Λ) with density m ( x ) = w i p ( x | λ i ) i =1 not sum of RVs!. Λ = { λ i } i , W = { w i } i ◮ Multimodal, universally modeling smooth densities ◮ Gaussian MMs with support X = R , Gamma MMs with support X = R + (modeling distances [34]) ◮ Pioneered by Karl Pearson [29] (1894). precursors: Francis Galton [13] (1869), Adolphe Quetelet [31] (1846), etc. ◮ Capture sub-populations within an overall population ( k = 2, crab data [29] in Pearson) c � 2014 Frank Nielsen 2/39

  3. Example of k = 2-component mixture [17] Sub-populations ( k = 2) within an overall population... Sub-species in species, etc. Truncated distributions (what is the support! black swans ?!) c � 2014 Frank Nielsen 3/39

  4. Sampling from mixtures: Doubly stochastic process To sample a variate x from a MM: ◮ Choose a component l according to the weight distribution w 1 , ..., w k (multinomial), ◮ Draw a variate x according to p ( x | λ l ). c � 2014 Frank Nielsen 4/39

  5. Statistical mixtures: Generative data models Image = 5D xyRGB point set GMM = feature descriptor for information retrieval (IR) Increase dimension d using color image s × s patches : d = 2 + 3 s 2 Source GMM Sample (stat img) Low-frequency information encoded into compact statistical model. c � 2014 Frank Nielsen 5/39

  6. Mixtures: ǫ -statistically learnable and ǫ -estimates Problem statement: Given n IID d -dimensional observations x 1 , ..., x n ∼ MM (Λ , W ), estimate MM (ˆ Λ , ˆ W ): ◮ Theoretical Computer Science (TCS) approach: ǫ -closely parameter recovery ( π : permutation) ◮ | w i − ˆ w π ( i ) | ≤ ǫ ◮ KL ( p ( x | λ i ) : p ( x | ˆ λ π ( i ) )) ≤ ǫ (or other divergences like TV, etc.) Consider ǫ -learnable MMs: ◮ min i w i ≥ ǫ ◮ KL ( p ( x | λ i ) : p ( x | λ i )) ≥ ǫ, ∀ i � = j (or other divergence) ◮ Statistical approach : Define the best model/MM as the one maximizing the likelihood function l (Λ , W ) = � i m ( x i | Λ , W ). c � 2014 Frank Nielsen 6/39

  7. Mixture inference: Incomplete versus complete likelihood ◮ Sub-populations within an overall population: observed data x i does not include the subpopulation label l i ◮ k = 2: Classification and Bayes error (upper bounded by Chernoff information [24]) ◮ Inference: Assume IID, maximize (log)-likelihood: ◮ Complete using indicator variables z i , j (for l i : z i , l i = 1): n k � � � � ( w j p ( x i | θ j )) z i , j = l c = log z i , j log( w j p ( x i | θ j )) i =1 j =1 i j ◮ Incomplete (hidden/latent variables) and log-sum intractability :   � � � l i = log m ( x | W , Λ) = log w j p ( x i | θ j )  i i j c � 2014 Frank Nielsen 7/39

  8. Mixture learnability and inference algorithms ◮ Which criterion to maximize? incomplete or complete likelihood? What kind of evaluation criteria? ◮ From Expectation-Maximization [8] (1977) to TCS methods : Polynomial learnability of mixtures [22, 15] (2014), mixtures and core-sets [10] for massive data sets, etc. Some technicalities: ◮ Many local maxima of likelihood functions l i and l c (EM converges locally and needs a stopping criterion) ◮ Multimodal density (# modes > k [9], ghost modes even for isotropic GMMs) ◮ Identifiability (permutation of labels, parameter distinctness) ◮ Irregularity: Fisher information may be zero [6], convergence speed of EM ◮ etc. c � 2014 Frank Nielsen 8/39

  9. Learning MMs: A geometric hard clustering viewpoint n � k max W , Λ l c ( W , Λ) = max max j =1 log( w j p ( x i | θ j )) Λ i =1 � ≡ min min j ( − log p ( x i | θ j ) − log w j ) W , Λ i n k � = min min j =1 D j ( x i ) , W , Λ i =1 where c j = ( w j , θ j ) ( cluster prototype ) and D j ( x i ) = − log p ( x i | θ j ) − log w j are potential distance-like functions . ◮ Maximizing the complete likelihood amounts to a geometric hard clustering [37, 11] for fixed w j ’s (distance D j ( · ) depends � on cluster prototypes c j ): min Λ i min j D j ( x i ). ◮ Related to classification EM [5] (CEM), hard/truncated EM ◮ Solution of arg max l c to initialize l i (optimized by EM) c � 2014 Frank Nielsen 9/39

  10. The k -MLE method: k -means type clustering algorithms k -MLE: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j D j ( x i ) ( center-based clustering , W fixed) � 3. Solve min W i min j D j ( x i ) (Λ fixed) 4. Test for convergence and go to step 2) otherwise. ⇒ group coordinate ascent (ML)/descent (distance) optimization. c � 2014 Frank Nielsen 10/39

  11. k -MLE: Center-based clustering, W fixed � Solve min min D j ( x i ) Λ j i k -means type convergence proof for assignment/relocation: ◮ Data assignment : ∀ i , l i = arg max j w j p ( x | λ j ) = arg min j D j ( x i ), C j = { x i | l i = j } ◮ Center relocation : ∀ j , λ j = MLE ( C j ) Farthest Maximum Likelihood (FML) Voronoi diagram : Vor FML ( c i ) = { x ∈ X : w i p ( x | λ i ) ≥ w j p ( x | λ j ) , ∀ i � = j } Vor ( c i ) = { x ∈ X : D i ( x ) ≤ D j ( x ) , ∀ i � = j } FML Voronoi ≡ additively weighted Voronoi with: D l ( x ) = − log p ( x | λ l ) − log w l c � 2014 Frank Nielsen 11/39

  12. k -MLE: Example for mixtures of exponential families Exponential family: Component density p ( x | θ ) = exp( t ( x ) ⊤ θ − F ( θ ) + k ( x )) is log-concave with: ◮ t ( x ): sufficient statistic in R D , D : family order. ◮ k ( x ): auxiliary carrier term (wrt Lebesgue/counting measure) ◮ F ( θ ): log-normalized, cumulant function, log-partition. D j ( x ) is convex: Clustering k -means wrt convex “distances”. Farthest ML Voronoi ≡ additively-weighted Bregman Voronoi [4]: F ( θ ) − t ( x ) ⊤ θ − k ( x ) − log w − log p ( x ; θ ) − log w = B F ∗ ( t ( x ) : η ) + F ∗ ( t ( x )) + k ( x ) − log w = F ∗ ( η ) = max θ ( θ ⊤ η − F ( θ )): Legendre-Fenchel convex conjugate c � 2014 Frank Nielsen 12/39

  13. Exponential families: Rayleigh distributions [36, 25] Application: IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: λ 2 e − x 2 p ( x ; λ ) = x 2 λ 2 x ∈ R + = X d = 1 (univariate) D = 1 (order 1) θ = − 1 2 λ 2 Θ = ( −∞ , 0) F ( θ ) = − log( − 2 θ ) t ( x ) = x 2 k ( x ) = log x (Weibull for k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models ( RMMs ): for segmentation and classification tasks c � 2014 Frank Nielsen 13/39

  14. Exponential families: Multivariate Gaussians [14, 25] Gaussian Mixture Models (GMMs). (Color image interpreted as a 5D xyRGB point set) Gaussian distribution p ( x ; µ, Σ): | Σ | e − 1 1 2 D Σ − 1 ( x − µ, x − µ ) 2 √ d (2 π ) Squared Mahalanobis distance: D Q ( x , y ) = ( x − y ) T Q ( x − y ) x ∈ R d = X d (multivariate) D = d ( d +3) (order) 2 θ = (Σ − 1 µ, 1 2 Σ − 1 ) = ( θ v , θ M ) Θ = R × S d ++ 1 v θ − 1 M θ v − 1 4 θ T F ( θ ) = 2 log | θ M | + d 2 log π t ( x ) = ( x , − xx T ) k ( x ) = 0 c � 2014 Frank Nielsen 14/39

  15. The k -MLE method for exponential families k -MLEEF: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j ( B F ∗ ( t ( x ) : η j ) − log w j ) � 3. Solve min W i min j D j ( x i ) 4. Test for convergence and go to step 2) otherwise. Assignment condition in Step 2: additively-weighted Bregman Voronoi diagram. c � 2014 Frank Nielsen 15/39

  16. k -MLE: Solving for weights given component parameters � Solve min min D j ( x i ) W j i Amounts to arg min W − n j log w j = arg min W − n j n log w j where n j = # { x i ∈ Vor ( c j ) } = |C j | . H × ( N : W ) min W ∈ ∆ k where N = ( n 1 n , ..., n k n ) is cluster point proportion vector ∈ ∆ k . Cross-entropy H × is minimized when H × ( N : W ) = H ( N ) that is W = N . Kullback-Leibler divergence: KL ( N : W ) = H × ( N : W ) − H ( N ) = 0 when W = N . c � 2014 Frank Nielsen 16/39

  17. MLE for exponential families Given a ML farthest Voronoi partition, computes MLEs θ j ’s: � ˆ θ j = arg max p F ( x i ; θ ) θ ∈ Θ x i ∈ Vor ( c j ) is unique (***) maximum since ∇ 2 F ( θ ) ≻ 0: θ j ) = 1 � Moment equation : ∇ F (ˆ θ j ) = η (ˆ t ( x i ) = ¯ t = ˆ η n j x i ∈ Vor ( c j ) MLE is consistent , efficient with asymptotic normal distribution : � � θ j , 1 ˆ I − 1 ( θ j ) θ j ∼ N n j Fisher information matrix I ( θ j ) = var [ t ( X )] = ∇ 2 F ( θ j ) = ( ∇ 2 F ∗ ) − 1 ( η j ) MLE may be biased (eg, normal distributions). c � 2014 Frank Nielsen 17/39

  18. Existence of MLEs for exponential families (***) For minimal and full EFs, MLE guaranteed to exist [3, 21] provided that matrix:   1 t 1 ( x 1 ) ... t D ( x 1 ) . . . .  . . . .  T = (1) . . . .   1 t 1 ( x n ) ... t D ( x n ) of dimension n × ( D + 1) has rank D + 1 [3]. For example, problems for MLEs of MVNs with n < d observations (undefined with likelihood ∞ ). � t = 1 Condition: ¯ x i ∈ Vor ( c j ) t ( x i ) ∈ int ( C ), where C is closed n j convex support . c � 2014 Frank Nielsen 18/39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend