pitfalls in mixtures from the clustering angle
play

Pitfalls in Mixtures from the Clustering Angle C. Biernacki (with - PowerPoint PPT Presentation

Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Pitfalls in Mixtures from the Clustering Angle C. Biernacki (with G. Castellan, S. Chr etien, B. Guedj, V. Vandewalle) Working Group on Model-Based


  1. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Pitfalls in Mixtures from the Clustering Angle C. Biernacki (with G. Castellan, S. Chr´ etien, B. Guedj, V. Vandewalle) Working Group on Model-Based Clustering Summer Session, Paris, July 17-23, 2016 1/72

  2. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Take home message Computational estimates ˜ θ are the imbricated result of five factors 1 An initial practitioner target t 2 A data set x 3 A theoretical model m 4 A theoretical estimate ˆ θ 5 An estimation algorithm A θ = f ( t , x , m , ˆ ˜ θ, A ) This talk Considered pitfalls in mixtures are degeneracy and label switching Consequences can be disastrous on ˜ θ Often, solutions are sought in m or ˆ θ We explore here also solutions through t and A Focus target t : clustering Focus algorithms A : EM, SEM, Gibbs 2/72

  3. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Outline 1 Overview 2 The degeneracy problem Individual data Binned data Missing data 3 Avoiding degeneracy Adding a minimal clustering information Strategy 1: a data-driven lower bound on variances Strategy 2: an approximate EMgood algorithm 4 The label switching problem The problem Existing solutions Proposed solution (in progress) 5 Conclusion 3/72

  4. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Unbounded likelihood d -variate g -Gaussian mixture with θ = ( { π k } , { µ k } , { Σ k } ) g 1 � − 1 � � 2 ( x − µ k ) ′ Σ − 1 ( x − µ k ) p ( x ; θ ) = π k (2 π ) d / 2 | Σ k | 1 / 2 exp k k =1 � �� � p ( x ; µ k , Σ k ) x n ) i . i . d . Sampling: x = ( x 1 , . . . , ∼ p ( . ; θ ) Likelihood: ℓ ( θ ; x ) = p ( x ; θ ) particular center µ 2 = ⇒ | Σ 2 |→ 0 ℓ ( θ ; x ) = + ∞ lim x i [Kiefer and Wolfowitz, 1956] [Day, 1969] 4/72

  5. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion EM behaviour: illustration 0.01 0.2 0.2 Iteration 1 Iteration 2 Iteration 50 0.15 0.15 Density Density Density 0.005 0.1 0.1 0.05 0.05 0 0 0 −10 0 10 20 −10 0 10 20 −10 0 10 20 x x x 0.2 0.2 0.2 Iteration 77 Iteration 78 Iteration 79 0.15 0.15 0.15 Density Density Density 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0 −10 0 10 20 −10 0 10 20 −10 0 10 20 x x x 0.2 0.2 0.4 Iteration 80 Iteration 81 Iteration 82 0.15 0.15 0.3 Density Density Density 0.1 0.1 0.2 0.05 0.05 0.1 0 0 0 −10 0 10 20 −10 0 10 20 −10 0 10 20 x x x degeneracy may occur even when starting from large variances convergence can be slow when far from the degenerate limit convergence extremely fast near degeneracy 5/72

  6. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion EM behaviour: results pi0k0 � � 1 u 0 = p i 0 k 0 , { p ik 0 } i � = i 0 component k0 degeneracy of component k 0 at x i 0 pik0 ⇔ � u 0 � → 0 xi0 xi [Biernacki and Chr´ etien, 2003] [Ingrassia and Rocci, 2009] Proposition 1: Existence of a bassin of attraction ∃ ǫ > 0 s.t. if � u 0 � ≤ ǫ then � u + 0 � = o � u 0 � with probability 1 . Proposition 2: Speed towards degeneracy is exponential ∃ ǫ > 0 , α > 0 and β > 0 s.t. if � u 0 � ≤ ǫ then, with probability 1, � � | Σ + k 0 | ≤ α/ | Σ k 0 | · exp − β/ | Σ k 0 | . 6/72

  7. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Consequences of the EM study When EM is close to degeneracy, EM mapping is contracting and reaches numerical tolerance extremely quickly ⇓ Simply starting again EM when numerical tolerance is reached (pragmatic bahaviour of EM practitioners) is now somewhat justified ⇓ However, the numerical tolerance is finally an arbitrary lower bound for | Σ k | . . . 7/72

  8. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Outline 1 Overview 2 The degeneracy problem Individual data Binned data Missing data 3 Avoiding degeneracy Adding a minimal clustering information Strategy 1: a data-driven lower bound on variances Strategy 2: an approximate EMgood algorithm 4 The label switching problem The problem Existing solutions Proposed solution (in progress) 5 Conclusion 8/72

  9. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Binned data A binned partition of R in H intervals Ω 1 , . . . , Ω H : Ω h =] α h , β h [ Individuals x i unknown, only the interval where x i lies is known Hypothesis of Gaussian mixture on x i ’s unchanged The log-likelihood is written a kh � �� � H K � � � � � ℓ ( θ ) = m h ln π k f k ( x ) dx ���� Ω h h =1 k =1 # Ω h � �� � p ( X ∈ Ω h ) Question Does degeneracy still exists since ℓ ( θ ) ≤ 0? 9/72

  10. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Degeneracy may still happen! Proposition 3 Let for all b ∈ N sequence { ǫ b } : ǫ b > 0 and ǫ b → 0 when b → ∞ � h , h = 1 , . . . , H b � h ≥ ǫ b then m b Ω b : if β b h − α b bins h = 0 Ω h b 0 is a non-empty interval and k 0 ∈ { 1 , . . . , K } a component � � ˆ θ b is the unique consistent root of the ML associated to (Ω b h , m b h ) ℓ b ( θ ) − → ℓ b deg ( θ ) when µ k 0 ∈ Ω h 0 et Σ k 0 → 0 . deg (ˆ θ b ) ≥ ℓ b (ˆ Thus, it exists B ∈ N such that for all b > B we have ℓ b θ b ) . Sketch of proof At a first time, we have to show that, for all θ , it exists B θ ∈ N such that for all b > B θ we have ℓ b deg ( θ ) ≥ ℓ b ( θ ). Then, we conclude by noting that B = sup θ B θ . 10/72

  11. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Meaning If dimension of non-empty bins is “small enough”, then the global maximum of the likelihood is obtained in a degenerate situation Histogram and mixture density Histogram and mixture density 0.7 3 degenerate mixture (L=−12.69) degenerate mixture (L=−20.9) undegenerate mixture (L=−11.44) undegenerate mixture (L=−21.11) 0.6 histogram (bar width=1) 2.5 histogram (bar width=0.2) 0.5 2 0.4 1.5 0.3 1 0.2 0.5 0.1 0 0 −1 0 1 2 3 4 5 6 7 −1 0 1 2 3 4 5 6 7 x x 11/72

  12. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion EM behaviour in a degeneracy neighborough? Remind � � component k 0 degenerates inside Ω h 0 ⇔ µ k 0 ∈ Ω h 0 and Σ k 0 → 0 Notations Ω h ′ 0 : bin the closest to the center µ k 0 (left or right of Ω h 0 ) γ : borderline of Ω h 0 the closest to µ k 0 (either α h 0 , or β h 0 ) η = | γ − µ k 0 | : distance between the center and the closest center σ = sign( γ − µ k 0 ) and u = Σ k 0 f k 0 ( γ ) R h = ( π k 0 + A k 0 h 0 ) / A k 0 h with A k 0 h = � k � = k 0 π k a kh 12/72

  13. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Possibility to be attracted around degeneracy Proposition 4 It exists ǫ > 0 such that, if 0 < Σ k 0 < ǫ η ∈ ( δ, ∆ − � Σ k 0 ) with 0 < δ < ∆ < ( β h 0 − α h 0 ) / 2 m h ′ 1 − 0 m h 0 R h ′ 0 > 0 then,   � m h ′ �  δ  e − ∆ 2 / (2Σ k 0 ) 0 < Σ +  0  k 0 < Σ k 0  1 − 1 − R h ′  2 � 2 π Σ k 0  m h 0 0  � �� � ρ and � � � η + ∈ Σ + δ, ∆ − . k 0 13/72

  14. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion sketch of proof It relies on Taylor expansions around Σ k 0 = 0 with µ k 0 ∈ Ω h 0 µ + Σ + k 0 = µ k 0 − σρ u + o ( u ) and k 0 = Σ k 0 − ηρ u + o ( u ) . Then the inequality on Σ k 0 arises easily. For the second expression, we obtain in the same manner (for Σ k 0 “small enough”) � δ < | γ − µ + Σ + k 0 | < ∆ − k 0 . k 0 | < ∆ < ( β h 0 − α h 0 ) / 2 and so γ + = γ (the closest borderline is kept Thus | γ − µ + unchanged). Since η + = | γ − µ + k 0 | , conclusion follows. 14/72

  15. Overview The degeneracy problem Avoiding degeneracy The label switching problem Conclusion Attraction or repulsion? Around a degenerate solution, EM runs closer or further depending on the sign of ρ which itself depends on the sample size of the “closest” bin. Attraction: ρ > 0 from the theorem, if Σ k 0 is “close enough” to 0 and µ k 0 ∈ Ω h 0 then 0 < Σ + µ + k 0 < Σ k 0 [1 − ρ × | fcte( θ ) | ] and k 0 ∈ Ω h 0 � �� � Σ k 0 decreases Repulsion: ρ < 0 Taylor: Σ + k 0 = Σ k 0 − ηρ u + o ( u ) ⇒ Σ k 0 increases 15/72

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend