on clustering histograms with k means by using mixed
play

On Clustering Histograms with k -Means by Using Mixed -Divergences - PowerPoint PPT Presentation

On Clustering Histograms with k -Means by Using Mixed -Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1 , 2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: Frank.Nielsen@acm.org 2 Ecole


  1. On Clustering Histograms with k -Means by Using Mixed α -Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1 , 2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: Frank.Nielsen@acm.org 2 ´ Ecole Polytechnique, France 3 NICTA/ANU, Australia 4 RIKEN Brain Science Institute, Japan 2014 c � 2014 Frank Nielsen 1/29

  2. Clustering histograms ◮ Information Retrieval systems (IRs) based on bag-of-words paradigm (bag-of-textons, bag-of-features, bag-of-X) ◮ The rˆ ole of distances: ◮ Initially, create a dictionary of “words” by quantizing using k -means clustering (depends on the underlying distance) ◮ At query time, find “closest” (histogram) document by querying with the histogram query ◮ Notation: Positive arrays h (counting histogram) versus frequency histograms ˜ h (normalized counting) d bins For IRs, prefer symmetric distances (not necessarily metrics) like the Jeffreys divergence or the Jensen-Shannon divergence (unified by a one parameterized family of divergences in [11]) c � 2014 Frank Nielsen 2/29

  3. Ali-Silvey-Csisz´ ar f -divergences An important class of divergences: f -divergences [10, 1, 7] defined for a convex generator f (with f (1) = f ′ (1) = 0 and f ′′ (1) = 1): d � p i � I f ( p : q ) . � q i f = q i i =1 Those divergences preserve information monotonicity [3] under any arbitrary transition probability (Markov morphisms). f -divergences can be extended to positive arrays [3]. c � 2014 Frank Nielsen 3/29

  4. Mixed divergences Defined on three parameters: M λ ( p : q : r ) . = λ D ( p : q ) + (1 − λ ) D ( q : r ) for λ ∈ [0 , 1]. Mixed divergences include: ◮ the sided divergences for λ ∈ { 0 , 1 } , ◮ the symmetrized (arithmetic mean) divergence for λ = 1 2 . c � 2014 Frank Nielsen 4/29

  5. Mixed divergence-based k -means clustering k distinct seeds from the dataset with l i = r i . Input : Weighted histogram set H , divergence D ( · , · ), integer k > 0, real λ ∈ [0 , 1]; Initialize left-sided/right-sided seeds C = { ( l i , r i ) } k i =1 ; repeat //Assignment for i = 1 , 2 , ..., k do C i ← { h ∈ H : i = arg min j M λ ( l j : h : r j ) } ; // Dual-sided centroid relocation for i = 1 , 2 , ..., k do r i ← arg min x D ( C i : x ) = � h ∈C i w j D ( h : x ); l i ← arg min x D ( x : C i ) = � h ∈C i w j D ( x : h ); until convergence ; Output : Partition of H into k clusters following C ; → different from the k -means clustering with respect to the symmetrized divergences c � 2014 Frank Nielsen 5/29

  6. α -divergences For α ∈ R � = ± 1, define α -divergences [6] on positive arrays [18] : d 4 � 1 − α p i + 1 + α � D α ( p : q ) . q i − ( p i ) 1 − α 1+ α � 2 ( q i ) = 2 1 − α 2 2 2 i =1 with D α ( p : q ) = D − α ( q : p ) and in the limit cases D − 1 ( p : q ) = KL ( p : q ) and D 1 ( p : q ) = KL ( q : p ), where KL is the extended Kullback–Leibler divergence: d p i log p i KL ( p : q ) . q i + q i − p i . � = i =1 c � 2014 Frank Nielsen 6/29

  7. α -divergences belong to f -divergences The α -divergences belong to the class of Csisz´ ar f -divergences with the following generator: 4 1 − t (1+ α ) / 2 �  � , if α � = ± 1 , 1 − α 2  f ( t ) = t ln t , if α = 1 ,  − ln t , if α = − 1 The Pearson and Neyman χ 2 distances are obtained for α = − 3 and α = 3: q i − ˜ p i ) 2 1 (˜ � D 3 (˜ p : ˜ q ) = , p i 2 ˜ i q i − ˜ p i ) 2 1 (˜ � D − 3 (˜ p : ˜ q ) = . q i 2 ˜ i c � 2014 Frank Nielsen 7/29

  8. Squared Hellinger symmetric distance is a α = 0-divergence Divergence D 0 is the squared Hellinger symmetric distance (scaled by 4) extended to positive arrays: � �� � 2 � d x = 4 H 2 ( p , q ) , D 0 ( p : q ) = 2 p ( x ) − q ( x ) with the Hellinger distance: � 1 � �� � 2 � H ( p , q ) = p ( x ) − q ( x ) d x 2 c � 2014 Frank Nielsen 8/29

  9. Mixed α -divergences ◮ Mixed α -divergence between a histogram x to two histograms p and q : M λ,α ( p : x : q ) = λ D α ( p : x ) + (1 − λ ) D α ( x : q ) , = λ D − α ( x : p ) + (1 − λ ) D − α ( q : x ) , = M 1 − λ, − α ( q : x : p ) , ◮ α -Jeffreys symmetrized divergence is obtained for λ = 1 2 : S α ( p , q ) = M 1 2 ,α ( q : p : q ) = M 1 2 ,α ( p : q : p ) ◮ skew symmetrized α -divergence is defined by: S λ,α ( p : q ) = λ D α ( p : q ) + (1 − λ ) D α ( q : p ) c � 2014 Frank Nielsen 9/29

  10. Coupled k -Means++ α -Seeding Algorithm 1: Mixed α -seeding; MAS ( H , k , λ , α ) Input : Weighted histogram set H , integer k ≥ 1, real λ ∈ [0 , 1], real α ∈ R ; Let C ← h j with uniform probability ; for i = 2 , 3 , ..., k do Pick at random histogram h ∈ H with probability: w h M λ,α ( c h : h : c h ) . π H ( h ) = y ∈H w y M λ,α ( c y : y : c y ) , (1) � //where ( c h , c h ) . = arg min ( z , z ) ∈C M λ,α ( z : h : z ); C ← C ∪ { ( h , h ) } ; Output : Set of initial cluster centers C ; c � 2014 Frank Nielsen 10/29

  11. A guaranteed probabilistic initialization Let C λ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew Jeffreys or mixed) in MAS and C opt λ,α denote the optimal related clustering in k clusters, for λ ∈ [0 , 1] and α ∈ ( − 1 , 1). Then, on average, with respect to distribution (1), the initial clustering of MAS satisfies: � f ( λ ) g ( k ) h 2 ( α ) C opt if λ ∈ (0 , 1) λ,α E π [ C λ,α ] ≤ 4 . g ( k ) z ( α ) h 4 ( α ) C opt otherwise λ,α � � 1 − λ λ Here, f ( λ ) = max λ , , g ( k ) = 2(2 + log k ) , z ( α ) = 1 − λ 8 | α | 2 (1 −| α | )2 , h ( α ) = max i p | α | � � / min i p | α | 1+ | α | ; the min is defined on i i 1 −| α | strictly positive coordinates, and π denotes the picking distribution. c � 2014 Frank Nielsen 11/29

  12. Mixed α -hard clustering: MAhC ( H , k , λ , α ) Input : Weighted histogram set H , integer k > 0, real λ ∈ [0 , 1], real α ∈ R ; Let C = { ( l i , r i ) } k i =1 ← MAS ( H , k , λ, α ); repeat //Assignment for i = 1 , 2 , ..., k do A i ← { h ∈ H : i = arg min j M λ,α ( l j : h : r j ) } ; // Centroid relocation for i = 1 , 2 , ..., k do 2 1 − α ; �� 1 − α � r i ← h ∈A i w i h 2 2 1+ α ; �� � 1+ α l i ← h ∈A i w i h 2 until convergence ; Output : Partition of H in k clusters following C ; c � 2014 Frank Nielsen 12/29

  13. Sided Positive α -Centroids [14] The left-sided l α and right-sided r α positive weighted α -centroid coordinates of a set of n positive histograms h 1 , ..., h n are weighted α -means :   n r i α = f − 1 � w j f α ( h i  , l i α = r i j ) α − α  j =1 � 1 − α x α � = ± 1 , 2 with f α ( x ) = log x α = 1 . c � 2014 Frank Nielsen 13/29

  14. Sided Frequency α -Centroids [2] Theorem (Amari, 2007) The coordinates of the sided frequency α -centroids of a set of n weighted frequency histograms are the normalised weighted α -means. c � 2014 Frank Nielsen 14/29

  15. Positive and Frequency α -centroids Summary: � 1 − α 2 ( � n j =1 w j ( h i 2 ) j ) α � = 1 1 − α ◮ r i α = 1 = � n r i j =1 ( h i j ) w j α = 1 � 1+ α 2 ( � n j =1 w j ( h i 2 ) j ) α � = − 1 1+ α ◮ l i α = r i − α = − 1 = � n l i j =1 ( h i j ) w j α = − 1 r i r i ◮ ˜ α = α w (˜ r α ) r i ◮ ˜ l i r i − α α = ˜ − α = w (˜ r − α ) c � 2014 Frank Nielsen 15/29

  16. Mixed α -Centroids Two centroids minimizer of: � w j M λ,α ( l : h j : r ) j Generalizing mixed Bregman divergences [16]: Theorem The two mixed α -centroids are the left-sided and right-sided α -centroids. c � 2014 Frank Nielsen 16/29

  17. Symmetrized Jeffreys-Type α -Centroids 1 S α ( p , q ) = 2 ( D α ( p : q ) + D α ( q : p )) = S − α ( p , q ) , = M 1 2 ( p : q : p ) , For α = ± 1, we get half of Jeffreys divergence: d ( p i − q i ) log p i S ± 1 ( p , q ) = 1 � q i 2 i =1 c � 2014 Frank Nielsen 17/29

  18. Jeffreys α -divergence and Heinz means When p and q are frequency histograms, we have for α � = ± 1: � d � 8 � p i , ˜ q i ) J α (˜ p : ˜ q ) = 1 + H 1 − α 2 (˜ 1 − α 2 i =1 where H 1 − α 2 ( a , b ) a symmetric Heinz mean [8, 5]: H β ( a , b ) = a β b 1 − β + a 1 − β b β 2 Heinz means interpolate the arithmetic and geometric means and satisfies the inequality: √ 2 ( a , b ) ≤ H α ( a , b ) ≤ H 0 ( a , b ) = a + b ab = H 1 . 2 c � 2014 Frank Nielsen 18/29

  19. Jeffreys divergence in the limit case For α = ± 1, S α ( p , q ) tends to the Jeffreys divergence: d ( p i − q i )(log p i − log q i ) � J ( p , q ) = KL ( p , q ) + KL ( q , p ) = i =1 The Jeffreys divergence writes mathematically the same for frequency histograms: d p i − ˜ p i − log ˜ � q i )(log ˜ q i ) J (˜ p , ˜ q ) = KL (˜ p , ˜ q ) + KL (˜ q , ˜ p ) = (˜ i =1 c � 2014 Frank Nielsen 19/29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend