jeffreys centroids a closed form expression for positive
play

Jeffreys centroids: A closed-form expression for positive histograms - PowerPoint PPT Presentation

Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms Frank Nielsen Frank.Nielsen@acm.org 5793b870 Sony Computer Science Laboratories, Inc. April 2013 c 2013


  1. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms Frank Nielsen Frank.Nielsen@acm.org 5793b870 Sony Computer Science Laboratories, Inc. April 2013 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/25

  2. Why histogram clustering? Task: Classify documents into categories: Bag-of-Word (BoW) modeling paradigm [3, 6]. ◮ Define a word dictionary, and ◮ Represent each document by a word count histogram. Centroid-based k -means clustering [1]: ◮ Cluster document histograms to learn categories, ◮ Build visual vocabularies by quantizing image features: Compressed Histogram of Gradient descriptors [4]. → histogram centroids w h = � d i =1 h i : cumulative sum of bin values ˜: normalization operator c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/25

  3. Why Jeffreys divergence? Distance between two frequency histograms ˜ p and ˜ q : Kullback-Leibler divergence or relative entropy . H × (˜ KL (˜ p : ˜ q ) = p : ˜ q ) − H (˜ p ) , d p i log 1 � H × (˜ p : ˜ q ) = ˜ q i , cross − entropy ˜ i =1 d p i log 1 � H (˜ p ) = H × (˜ p : ˜ p ) = ˜ p i , Shannon entropy . ˜ i =1 → expected extra number of bits per datum that must be transmitted when using the “wrong” distribution ˜ q instead of the true distribution ˜ p . ˜ p is hidden by nature (and hypothesized), ˜ q is estimated. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/25

  4. Why Jeffreys divergence? When clustering histograms, all histograms play the same role → Jeffreys [8] divergence: J ( p , q ) = KL ( p : q ) + KL ( q : p ) , d ( p i − q i ) log p i � J ( p , q ) = q i = J ( q , p ) . i =1 → symmetrizes the KL divergence. (also called J -divergence or symmetrical Kullback-Leibler divergence, etc.) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/25

  5. Jeffreys centroids: frequency and positive centroids A set H = { h 1 , ..., h n } of weighted histograms . n � c = arg min π j J ( h j , x ) , x j =1 π j > 0’s histogram positive weights: � n j =1 π j = 1. ◮ Jeffreys positive centroid c : n � c = arg min π j J ( h j , x ) , x ∈ R d + j =1 ◮ Jeffreys frequency centroid ˜ c : n � π j J (˜ c = arg min ˜ h j , x ) , x ∈ ∆ d j =1 ∆ d : Probability ( d − 1)-dimensional simplex. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/25

  6. Prior work ◮ Histogram clustering wrt. χ 2 distance [10] ◮ Histogram clustering wrt. Bhattacharyya distance [11, 13] ◮ Histogram clustering wrt. Kullback-Leibler distance as Bregman k -means clustering [1] ◮ Jeffreys frequency centroid [16] (Newton numerical optimization) ◮ Jeffreys frequency centroid as equivalent symmetrized Bregmen centroid [14] ◮ Mixed Bregman clustering [15] ◮ Smooth family of KL symmetrized centroids including Jensen-Shannon centroids and Jeffreys centroids in limit case [12] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/25

  7. Jeffreys positive centroid n � c = arg min J ( H , x ) = arg min π j J ( h j , x ) . x ∈ R d x ∈ R d + + j =1 Theorem (Theorem 1) The Jeffreys positive centroid c = ( c 1 , ..., c d ) of a set { h 1 , ..., h n } of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: a i c i = , W ( a i g i e ) where a i = � n j =1 π j h i j denotes the coordinate-wise arithmetic weighted means and g i = � n j =1 ( h i j ) π j the coordinate-wise geometric weighted means. Lambert analytic function [2] W ( x ) e W ( x ) = x for x ≥ 0. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/25

  8. Jeffreys positive centroid (proof) n � min π j J ( h j , x ) x j =1 n d � � ( h i j − x i )(log h i j − log x i ) min π j x j =1 i =1 d n � � π j ( x i log x i − x i log h i j − h i j log x i ) ≡ min x i =1 j =1 d n n � � � x i log x i − x i log ( h i π j h i a log x i j ) π j − j i =1 j =1 j =1 � �� � � �� � g d x i log x i � g − a log x i min x i =1 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/25

  9. Jeffreys positive centroid (proof) Coordinate-wise minimize: x x log x min g − a log x Setting the derivative to zero, we solve: log x g + 1 − a x = 0 and get a x = W ( a g e ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/25

  10. Jeffreys frequency centroid: A guaranteed approximation n � π j J (˜ ˜ c = arg min h j , x ) , x ∈ ∆ d j =1 Relaxing x from probability simplex ∆ d to R d + , we get a i c ′ = c � , c i = c i ˜ , w c = w c W ( a i g i e ) i Lemma (Lemma 1) The cumulative sum w c of the bin values of the Jeffreys positive centroid c of a set of frequency histograms is less or equal to one: 0 < w c ≤ 1 . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/25

  11. Proof of Lemma 1 From Theorem 1: d d a i � � c i = w c = . W ( a i g i e ) i =1 i =1 Arithmetic-geometric mean inequality: a i ≥ g i g i e ) ≥ 1 and c i ≤ a i . Thus Therefore W ( a i d d � � c i ≤ a i = 1 w c = i =1 i =1 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/25

  12. Lemma 2 Lemma (Lemma 2) For any histogram x and frequency histogram ˜ h, we have J ( x , ˜ x , ˜ x : ˜ h ) = J (˜ h ) + ( w x − 1)( KL (˜ h ) + log w x ) , where w x denotes the normalization factor (w x = � d i =1 x i ). J ( x , ˜ x , ˜ x : ˜ H ) = J (˜ H ) + ( w x − 1)( KL (˜ H ) + log w x ) , H ) = � n where J ( x , ˜ j =1 π j J ( x , ˜ h j ) and H ) = � n h j ) (with � n x : ˜ x , ˜ KL (˜ j =1 π j KL (˜ j =1 π j = 1). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/25

  13. Proof of Lemma 2 x i = w x ˜ x i d x i h i ) log w x ˜ � x i − ˜ J ( x , ˜ h ) = ( w x ˜ ˜ h i i =1 d ˜ x i h i x i log ˜ � x i log w x + ˜ h i log h i log w x ) J ( x , ˜ x i − ˜ h ) = ( w x ˜ h i + w x ˜ ˜ ˜ i =1 d x i x i log ˜ � x , ˜ = ( w x − 1) log w x + J (˜ h ) + ( w x − 1) ˜ ˜ h i i =1 x , ˜ x : ˜ = J (˜ h ) + ( w x − 1)( KL (˜ h ) + log w x ) since � d h i = � d x i = 1. i =1 ˜ i =1 ˜ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/25

  14. Guaranteed approximation of ˜ c Theorem (Theorem 2) c ′ = c Let ˜ c denote the Jeffreys frequency centroid and ˜ w c the normalized Jeffreys positive centroid. Then the approximation c ′ , ˜ c ′ = J (˜ H ) 1 c ′ ≤ factor α ˜ H ) is such that 1 ≤ α ˜ w c (with w c ≤ 1 ). c , ˜ J (˜ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/25

  15. Proof of Theorem 2 J ( c , ˜ c , ˜ c ′ , ˜ H ) ≤ J (˜ H ) ≤ J (˜ H ) From Lemma 2, since c ′ , ˜ H ) = J ( c , ˜ c ′ , ˜ J (˜ H ) + (1 − w c )( KL (˜ H ) + log w c )) and J ( c , ˜ c , ˜ H ) ≤ J (˜ H ) c ′ , ˜ c ′ ≤ 1 + (1 − w c )( KL (˜ H ) + log w c ) 1 ≤ α ˜ c , ˜ J (˜ H ) H ) = 1 c ′ : ˜ KL ( c , ˜ KL (˜ H ) − log w c w c c ′ ≤ 1 + (1 − w c ) KL ( c , ˜ H ) α ˜ c , ˜ w c J (˜ H ) c , ˜ H ) ≥ J ( c , ˜ H ) and KL ( c , ˜ H ) ≤ J ( c , ˜ Since J (˜ H ), we get 1 c ′ ≤ w c . α ˜ When w c = 1 the bound is tight. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/25

  16. In practice... c in closed-form → compute w c , KL ( c , ˜ H ), J ( c , ˜ H ). Bound the approximation factor α ˜ c ′ as: � 1 � KL ( c , ˜ H ) ≤ 1 c ′ ≤ 1 + − 1 α ˜ J ( c , ˜ w c w c H ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/25

  17. Fine approximation From [16, 14], minimization of Jeffreys frequency centroid equivalent to: c = arg min ˜ KL (˜ a : ˜ x ) + KL (˜ x : ˜ g ) x ∈ ∆ d ˜ Lagrangian function enforcing � i c i = 1: c i a i log ˜ g i + 1 − ˜ c i + λ = 0 ˜ ˜ a i ˜ c i = ˜ � � ˜ a i e λ +1 W g i ˜ λ = − KL (˜ c : ˜ g ) ≤ 0 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/25

  18. Fine approximation: Bisection search a i ˜ c i ≤ 1 ⇒ c i = � ≤ 1 � ˜ a i e λ +1 W g i ˜ a i ˜ a i ˜ λ ≥ log( e ˜ g i ) − 1 ∀ i , log( e ˜ g i ) − 1 , 0] λ ∈ [max i d a i ˜ � � c i ( λ ) = s ( λ ) = � � a i e λ +1 ˜ W i i =1 g i ˜ Function s : monotonously decreasing with s (0) ≤ 1. → Bisection search for s ( λ ∗ ) ≃ 1 for arbitrary precision. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/25

  19. Experiments: Caltech-256 Caltech-256 [7]: 30607 images labeled into 256 categories (256 Jeffreys centroids). Arbitrary floating-point precision: http://www.apfloat.org/ c ′′ = ˜ a + ˜ g ˜ 2 c ′ ( n ′ lized approx . ) wc ≤ 1( n ′ lizing coeff . t ) α c ( optimal positive ) c ′′ ( Veldhuis’ approx. ) α ˜ α ˜ avg 0 . 9648680345638155 1 . 0002205080964255 0 . 9338228644308926 1 . 065590178484613 min 0 . 906414219584823 1 . 0000005079528809 0 . 8342819488534723 1 . 0027707382095195 max 0 . 9956399220678585 1 . 0000031489541772 0 . 9931975105809021 1 . 3582296675397754 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/25

  20. Experiments: Synthetic data-sets Random binary histograms c ′ ) α = J (˜ c ) ≥ 1 J (˜ Performance: α ∼ 1 . 0000009 , α max ∼ 1 . 00181506 , α min = 1 . 000000. ¯ Express better worst-case upper bound performance? c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend