On Clustering Histograms with k -Means by Using Mixed -Divergences - - PowerPoint PPT Presentation

on clustering histograms with k means by using mixed
SMART_READER_LITE
LIVE PREVIEW

On Clustering Histograms with k -Means by Using Mixed -Divergences - - PowerPoint PPT Presentation

On Clustering Histograms with k -Means by Using Mixed -Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1 , 2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: Frank.Nielsen@acm.org 2 Ecole


slide-1
SLIDE 1

On Clustering Histograms with k-Means by Using Mixed α-Divergences

Entropy 16(6): 3273-3301 (2014) Frank Nielsen1,2 Richard Nock3 Shun-ichi Amari4

1 Sony Computer Science Laboratories, Japan

E-Mail: Frank.Nielsen@acm.org

2 ´

Ecole Polytechnique, France

3 NICTA/ANU, Australia 4 RIKEN Brain Science Institute, Japan

2014

c 2014 Frank Nielsen 1/29

slide-2
SLIDE 2

Clustering histograms

◮ Information Retrieval systems (IRs) based on bag-of-words

paradigm (bag-of-textons, bag-of-features, bag-of-X)

◮ The rˆ

  • le of distances:

◮ Initially, create a dictionary of “words” by quantizing using

k-means clustering (depends on the underlying distance)

◮ At query time, find “closest” (histogram) document by

querying with the histogram query

◮ Notation: Positive arrays h (counting histogram) versus

frequency histograms ˜ h (normalized counting) d bins For IRs, prefer symmetric distances (not necessarily metrics) like the Jeffreys divergence or the Jensen-Shannon divergence (unified by a one parameterized family of divergences in [11])

c 2014 Frank Nielsen 2/29

slide-3
SLIDE 3

Ali-Silvey-Csisz´ ar f -divergences

An important class of divergences: f -divergences [10, 1, 7] defined for a convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1): If (p : q) . =

d

  • i=1

qif pi qi

  • Those divergences preserve information monotonicity [3] under

any arbitrary transition probability (Markov morphisms). f -divergences can be extended to positive arrays [3].

c 2014 Frank Nielsen 3/29

slide-4
SLIDE 4

Mixed divergences

Defined on three parameters: Mλ(p : q : r) . = λD(p : q) + (1 − λ)D(q : r) for λ ∈ [0, 1]. Mixed divergences include:

◮ the sided divergences for λ ∈ {0, 1}, ◮ the symmetrized (arithmetic mean) divergence for λ = 1 2.

c 2014 Frank Nielsen 4/29

slide-5
SLIDE 5

Mixed divergence-based k-means clustering

k distinct seeds from the dataset with li = ri. Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1]; Initialize left-sided/right-sided seeds C = {(li, ri)}k

i=1;

repeat //Assignment for i = 1, 2, ..., k do Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)}; // Dual-sided centroid relocation for i = 1, 2, ..., k do ri ← arg minx D(Ci : x) =

h∈Ci wjD(h : x);

li ← arg minx D(x : Ci) =

h∈Ci wjD(x : h);

until convergence; Output: Partition of H into k clusters following C; → different from the k-means clustering with respect to the symmetrized divergences

c 2014 Frank Nielsen 5/29

slide-6
SLIDE 6

α-divergences

For α ∈ R = ±1, define α-divergences [6] on positive arrays [18] : Dα(p : q) . =

d

  • i=1

4 1 − α2 1 − α 2 pi + 1 + α 2 qi − (pi)

1−α 2 (qi) 1+α 2

  • with Dα(p : q) = D−α(q : p) and in the limit cases

D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended Kullback–Leibler divergence: KL(p : q) . =

d

  • i=1

pi log pi qi + qi − pi.

c 2014 Frank Nielsen 6/29

slide-7
SLIDE 7

α-divergences belong to f -divergences

The α-divergences belong to the class of Csisz´ ar f -divergences with the following generator: f (t) =   

4 1−α2

  • 1 − t(1+α)/2

, if α = ±1, t ln t, if α = 1, − ln t, if α = −1 The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3: D3(˜ p : ˜ q) = 1 2

  • i

(˜ qi − ˜ pi)2 ˜ pi , D−3(˜ p : ˜ q) = 1 2

  • i

(˜ qi − ˜ pi)2 ˜ qi .

c 2014 Frank Nielsen 7/29

slide-8
SLIDE 8

Squared Hellinger symmetric distance is a α = 0-divergence

Divergence D0 is the squared Hellinger symmetric distance (scaled by 4) extended to positive arrays: D0(p : q) = 2 p(x) −

  • q(x)

2 dx = 4H2(p, q), with the Hellinger distance: H(p, q) =

  • 1

2 p(x) −

  • q(x)

2 dx

c 2014 Frank Nielsen 8/29

slide-9
SLIDE 9

Mixed α-divergences

◮ Mixed α-divergence between a histogram x to two histograms

p and q: Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q), = λD−α(x : p) + (1 − λ)D−α(q : x), = M1−λ,−α(q : x : p),

◮ α-Jeffreys symmetrized divergence is obtained for λ = 1 2:

Sα(p, q) = M 1

2 ,α(q : p : q) = M 1 2,α(p : q : p)

◮ skew symmetrized α-divergence is defined by:

Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p)

c 2014 Frank Nielsen 9/29

slide-10
SLIDE 10

Coupled k-Means++ α-Seeding

Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α) Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R; Let C ← hj with uniform probability ; for i = 2, 3, ..., k do Pick at random histogram h ∈ H with probability: πH(h)

.

= whMλ,α(ch : h : ch)

  • y∈H wyMλ,α(cy : y : cy) ,

(1) //where (ch, ch) . = arg min(z,z)∈C Mλ,α(z : h : z); C ← C ∪ {(h, h)}; Output: Set of initial cluster centers C;

c 2014 Frank Nielsen 10/29

slide-11
SLIDE 11

A guaranteed probabilistic initialization

Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew Jeffreys or mixed) in MASand C opt

λ,α

denote the optimal related clustering in k clusters, for λ ∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (1), the initial clustering of MAS satisfies: Eπ[Cλ,α] ≤ 4

  • f (λ)g(k)h2(α)C opt

λ,α

if λ ∈ (0, 1) g(k)z(α)h4(α)C opt

λ,α

  • therwise

. Here, f (λ) = max

  • 1−λ

λ , λ 1−λ

  • , g(k) = 2(2 + log k), z(α) =
  • 1+|α|

1−|α|

  • 8|α|2

(1−|α|)2 , h(α) = maxi p|α|

i

/ mini p|α|

i

; the min is defined on strictly positive coordinates, and π denotes the picking distribution.

c 2014 Frank Nielsen 11/29

slide-12
SLIDE 12

Mixed α-hard clustering: MAhC(H, k, λ, α)

Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R; Let C = {(li, ri)}k

i=1 ← MAS(H, k, λ, α);

repeat //Assignment for i = 1, 2, ..., k do Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)}; // Centroid relocation for i = 1, 2, ..., k do ri ←

  • h∈Ai wih

1−α 2

  • 2

1−α ;

li ←

  • h∈Ai wih

1+α 2

  • 2

1+α ;

until convergence; Output: Partition of H in k clusters following C;

c 2014 Frank Nielsen 12/29

slide-13
SLIDE 13

Sided Positive α-Centroids [14]

The left-sided lα and right-sided rα positive weighted α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means: r i

α = f −1 α

 

n

  • j=1

wjfα(hi

j)

  , li

α = r i −α

with fα(x) =

  • x

1−α 2

α = ±1, log x α = 1.

c 2014 Frank Nielsen 13/29

slide-14
SLIDE 14

Sided Frequency α-Centroids [2]

Theorem (Amari, 2007)

The coordinates of the sided frequency α-centroids of a set of n weighted frequency histograms are the normalised weighted α-means.

c 2014 Frank Nielsen 14/29

slide-15
SLIDE 15

Positive and Frequency α-centroids

Summary:

◮ r i α =

  • (n

j=1 wj(hi j)

1−α 2 ) 2 1−α

α = 1 r i

1 = n j=1(hi j)wj

α = 1

◮ li α = r i −α =

  • (n

j=1 wj(hi j)

1+α 2 ) 2 1+α

α = −1 li

−1 = n j=1(hi j)wj

α = −1

◮ ˜

r i

α = ri

α

w(˜ rα) ◮ ˜

li

α = ˜

r i

−α = ri

−α

w(˜ r−α)

c 2014 Frank Nielsen 15/29

slide-16
SLIDE 16

Mixed α-Centroids

Two centroids minimizer of:

  • j

wjMλ,α(l : hj : r) Generalizing mixed Bregman divergences [16]:

Theorem

The two mixed α-centroids are the left-sided and right-sided α-centroids.

c 2014 Frank Nielsen 16/29

slide-17
SLIDE 17

Symmetrized Jeffreys-Type α-Centroids

Sα(p, q) = 1 2 (Dα(p : q) + Dα(q : p)) = S−α(p, q), = M 1

2 (p : q : p),

For α = ±1, we get half of Jeffreys divergence: S±1(p, q) = 1 2

d

  • i=1

(pi − qi) log pi qi

c 2014 Frank Nielsen 17/29

slide-18
SLIDE 18

Jeffreys α-divergence and Heinz means

When p and q are frequency histograms, we have for α = ±1: Jα(˜ p : ˜ q) = 8 1 − α2

  • 1 +

d

  • i=1

H 1−α

2 (˜

pi, ˜ qi)

  • where H 1−α

2 (a, b) a symmetric Heinz mean [8, 5]:

Hβ(a, b) = aβb1−β + a1−βbβ 2 Heinz means interpolate the arithmetic and geometric means and satisfies the inequality: √ ab = H 1

2(a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b

2 .

c 2014 Frank Nielsen 18/29

slide-19
SLIDE 19

Jeffreys divergence in the limit case

For α = ±1, Sα(p, q) tends to the Jeffreys divergence: J(p, q) = KL(p, q) + KL(q, p) =

d

  • i=1

(pi − qi)(log pi − log qi) The Jeffreys divergence writes mathematically the same for frequency histograms: J(˜ p, ˜ q) = KL(˜ p, ˜ q) + KL(˜ q, ˜ p) =

d

  • i=1

(˜ pi − ˜ qi)(log ˜ pi − log ˜ qi)

c 2014 Frank Nielsen 19/29

slide-20
SLIDE 20

Analytic formula for the positive Jeffreys centroid [12]

Theorem (Jeffreys positive centroid [12])

The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn}

  • f n weighted positive histograms with d bins can be calculated

component-wise exactly using the Lambert W analytic function: ci = ai W ( ai

gi e)

where ai = n

j=1 πjhi j denotes the coordinate-wise arithmetic

weighted means and gi = n

j=1(hi j)πj the coordinate-wise

geometric weighted means. The Lambert analytic function W [4] (positive branch) is defined by W (x)eW (x) = x for x ≥ 0.

c 2014 Frank Nielsen 20/29

slide-21
SLIDE 21

Jeffreys frequency centroid [12]

Theorem (Jeffreys frequency centroid [12])

Let ˜ c denote the Jeffreys frequency centroid and ˜ c′ =

c wc the

normalised Jeffreys positive centroid. Then, the approximation factor α˜

c′ = S1(˜ c′, ˜ H) S1(˜ c, ˜ H) is such that 1 ≤ α˜ c′ ≤ 1 wc (with wc ≤ 1).

better upper bounds in [12].

c 2014 Frank Nielsen 21/29

slide-22
SLIDE 22

Reducing a n-size problem to a 2-size problem

Generalize [17] (symmetrized Kullback–Leibler divergence) and [15] (symmetrized Bregman divergence)

Lemma (Reduction property)

The symmetrized Jα-centroid of a set of n weighted histograms amount to computing the symmetrized α-centroid for the weighted α-mean and −α-mean: min Jα(x, H) = min

x (Dα(x : rα) + Dα(lα : x)) .

c 2014 Frank Nielsen 22/29

slide-23
SLIDE 23

Frequency symmetrized α-centroid

Minimizer of min˜

x∈∆d

  • j wjSα(˜

x, ˜ hi) Instead of seeking for ˜ x in the probability simplex, we can optimize

  • n the unconstrained domain Rd−1 by using the natural parameter

reparameterization [13] of multinomials.

Lemma

The α-divergence for distributions belonging to the same exponential families amounts to computing a divergence on the corresponding natural parameters: Aα(p : q) = 4 1 − α2

  • 1 − e−J( 1−α

2 ) F

(θp:θq)

  • ,

where Jβ

F (θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2)

is a skewed Jensen divergence defined for the log-normaliser F of the family.

c 2014 Frank Nielsen 23/29

slide-24
SLIDE 24

Implementation (in processing.org)

Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with k = 8, and α = 0.7 and λ = 1

2.

c 2014 Frank Nielsen 24/29

slide-25
SLIDE 25

Soft Mixed α-Clustering

Learn both α and λ (α-EM [9])

Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R; Let C = {(li, ri)}k

i=1 ← MAS(H, k, λ, α);

repeat //Expectation for i = 1, 2, ..., m do for j = 1, 2, ..., k do p(j|hi) =

πj exp(−Mλ,α(lj:hi:rj))

  • j′ πj′ exp(−Mλ,α(lj′:hi:rj′));

//Maximization for j = 1, 2, ..., k do πj ← 1

m

  • i p(j|hi);

li ←

  • 1
  • i p(j|hi)
  • i p(j|hi)h

1+α 2

i

  • 2

1+α

; ri ←

  • 1
  • i p(j|hi)
  • i p(j|hi)h

1−α 2

i

  • 2

1−α

; //Alpha - Lambda α ← α − η1 k

j=1

m

i=1 p(j|hi) ∂ ∂αMλ,α(lj : hi : rj);

if λinit = 0, 1 then λ ← λ − η2 k

j=1

m

i=1 p(j|hi)Dα(lj : hi)−

k

j=1

m

i=1 p(j|hi)Dα(hi : rj)

  • ;

//for some small η1, η2; ensure that λ ∈ [0, 1]. until convergence; Output: Soft clustering of H according to k densities p(j|.) following C;

c 2014 Frank Nielsen 25/29

slide-26
SLIDE 26

Summary

  • 1. Mixed divergences,mixed divergence k-means++ seeding,

coupled k-means seeding

  • 2. Sided left or right α-centroid k-means
  • 3. Coupled k-means with respect to mixed α-divergences relying
  • n dual α-centroids
  • 4. Symmetrized Jeffreys-type α-centroid (variational) k-means,

All technical proofs and details in: Entropy 16(6): 3273-3301 (2014)

c 2014 Frank Nielsen 26/29

slide-27
SLIDE 27

Bibliographic references I

Syed Mumtaz Ali and Samuel David Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966. Shun-ichi Amari. Integration of stochastic models by minimizing α-divergence. Neural Computation, 19(10):2780–2796, 2007. Shun-ichi Amari. alpha-divergence is unique, belonging to both f -divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.

  • D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.

Real values of the W -function. ACM Trans. Math. Softw., 21(2):161–171, June 1995. ´ Ad´ am Besenyei. On the invariance equation for Heinz means. Mathematical Inequalities & Applications, 15(4):973–979, 2012. Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy, 13(1):134–170, 2011. Imre Csisz´ ar. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967. c 2014 Frank Nielsen 27/29

slide-28
SLIDE 28

Bibliographic references II

Erhard Heinz. Beitr¨ age zur st¨

  • rungstheorie der spektralzerlegung.

Mathematische Annalen, 123:415–438, 1951. Yasuo Matsuyama. The alpha-EM algorithm: surrogate likelihood maximization using alpha-logarithmic information measures. IEEE Transactions on Information Theory, 49(3):692–706, 2003. Tetsuzo Morimoto. Markov processes and the h-theorem. Journal of the Physical Society of Japan, 18(3), March 1963. Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. CoRR, abs/1009.4004, 2010. Frank Nielsen. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Processing Letters (SPL), 20(7), July 2013. Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards, 2009. arXiv.org:0911.4863. Frank Nielsen and Richard Nock. The dual Voronoi diagrams with respect to representational Bregman divergences. In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, DTU Lyngby, Denmark, June 2009. IEEE. c 2014 Frank Nielsen 28/29

slide-29
SLIDE 29

Bibliographic references III

Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE Transactions on Information Theory, 55(6):2882–2904, 2009. Richard Nock, Panu Luosto, and Jyrki Kivinen. Mixed Bregman clustering with approximation guarantees. In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases, pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag. Raymond N. J. Veldhuis. The centroid of the symmetrical Kullback-Leibler distance. IEEE signal processing letters, 9(3):96–99, March 2002. Huaiyu Zhu and Richard Rohwer. Measurements of generalisation based on information geometry. In StephenW. Ellacott, JohnC. Mason, and IainJ. Anderson, editors, Mathematics of Neural Networks, volume 8 of Operations Research/Computer Science Interfaces Series, pages 394–398. Springer US, 1997. c 2014 Frank Nielsen 29/29