On approximate geometric k - clustering Jiri Matousek DCG 2000 - - PowerPoint PPT Presentation
On approximate geometric k - clustering Jiri Matousek DCG 2000 - - PowerPoint PPT Presentation
On approximate geometric k - clustering Jiri Matousek DCG 2000 Problem Given n -point set and k>1 find a partition of minimum cost Geometric cost function Approximately
Problem
- Given n-point set and k>1 find a
partition of minimum cost
- “Geometric” cost function
- Approximately optimal
Results
- 2-clustering
– n log n approximate algorithm for fixed ε>0
- k-clustering
- Can be improved with known lower bound on
cluster size
Really Easy Problem
- Given X and cluster centers c1, c2, …, ck, find
the optimal clustering of X
General Approach
- Snap X to a grid
- Cover the space with points
– potential cluster centers – polynomial in n
- Test subsets of size k to find the best centers
– only test sufficiently different subsets
- Aiming for a “near linear” algorithm
Polynomial Grid
- Thm: Let d and k0 be fixed. Suppose there is an algorithm A that, for a given ε>0, k
k0, and an n-point multiset X’ Rd with points lying on an integer grid of size O(n3/ε), finds an (1+ε)-approximately optimal k-clustering of X’. Then a (1+ε)- approximately optimal k0 clustering for an arbitrary n-point subset X Rd can be computed with O(n log n) preprocessing and with at most C calls to algorithm A, with various at most n-point sets X’, with k k0, and with αε instead of ε, where α>0 and C are constants.
- Pf: Grid size δ = αε∆/5n2, ∆ = diam(X)/n. Let X be the original and X’ the snapped
- set. If Π is the clustering of X, and Π’ is corresponding clustering of X’
– max change in distance to cluster center |diam(X)2 – (diam(X) + 2 δ)2| 5 δ diam(X) – – If cost(Π’) > 1/20∆2, since Π’ is (1+αε)-approximate for X’, then corresponding Π is (1+ε)-approximate for X – Otherwise, apply the algorithm recursively groups of nearby clusters
Approximate Centroid Sets
- Dfn: C is an ε-approximate centroid set for X if it intersects ε-
tolerance ball of every subset of X (of size at least s)
– ε-tolerance ball for S is centered at c(S), with radius
- Thm: Let be a finite point set and let k2, and let C be
an ε –approximate centroid set for S with cluster size s. Then there are c1, c2, …, ck C s.t. for all k-clusterings of with all clusters of size at least s
- Pf: By algebra using definition of centroid sets
Contruction
Subdivide as long as Q contains at least s/2d+1 points
Contruction
Contruction
Closest σ larger than diam(B) B intersects at most 2d cubes Some cube has s/2d+1 points
Centroid Sets
- Thm: and the
construction can be performed in time
- Pf: There are at most different side
lengths, and each cube contains at least points, so there are
Well-separated Pairs
- Dfn: (x, y) and (x’, y’) are ε -near if
and
- Set P is ε –separated if no two pairs are ε –near
- P is ε –complete for X if for every pair in X, there is an ε –near pair in P
- k-tuples (c1, c2, … , ck) and (c1’, c2’, … , ck’) are ε –near (complete,
separated) if all pairs (ci, cj) and (ci’, cj’) are ε –near
x x’ y’ y r εr
Approximate k-clustering
- Thm: Let (c1, c2, … , ck) and (c1’, c2’, …, ck’) be two k-tuples
in Rd that are ε-near, ε 1/9. Let Π = ΠVor(c1, c2, …, ck) and Π’ = ΠVor(c1’, c2’, …, ck’), then cost(Π’) (1+6ε) cost(Π)
- Pf:
c2’ c2 c1 c1’ δ εδ x S1’ S1
Approximate k-clustering
- Instead of looking at all k-tuples in C, we only need to look at
ε-complete set.
– Still too many for near-linear algorithm
- Look at ε –well spread tuples instead, i.e. no subset is
1/ε isolated
– YX is 1/ε isolated if there is xX/Y such that that is a distance 1/ε diam(Y) from Y – If X is ε –well spread, then diam(X) (2/ε)k-1δ
d εR 1/ε d R x
Building Cluster Centers
- Let C Rd be an m-point set. Then we can compute the set C’
- f k-tuples s.t.
– For any ε –well spread k-tuple in C, there is a tuple in C’ that is ε –near it – |C|=O(mε-k2d) – Each k-tuple in C’ is ε/2 –well spread – At least one point in each k-tuple in C’ belongs to C – There are no more than O(1) k-tuples of C’ lying near any given k- tuple in Rd – The minimum and maximum distance of points in each k-tuple in C’ are bounded by the constant multiples of minimum and maximum distance in C
- The running time is O(m log m + m ε-k2d)
Building Cluster Centers
- Generate a set of ε/2 –complete set of pairs P C C
– There are O(mε-d) pairs, running time O(m log m + mε-d) – Each pair will be a basis for an ε/2 –well spread k-tuple
x y δ
Algorithm
- If k=1, return X
- For k*=2,3…,k generate the set C* of k*-tuples for C
- For each (c1, c2,…,ck*) C*, let (X1, X2,…,Xk*)=ΠVor(c1, c2,
…,ck*)
- Let Ci be the points lying in εδ –neighborhood of ci
– δ is the smallest pairwise distance among c1,…,ck
- For i=1…k*, call the algorithm recursively on Xi and Ci
– vary number of clusters from 1 to k-k*+1 – find the combination k1+k2+…+kk*=k with the smallest cost
- For each k* tuple in C* with all Ci non-empty, output one with
smallest cost
c3 c4 c1 c2
Algorithm
δ εδ X3 C3 k=10 k*=4 ki = 1,2…7
Correctness
- For any k-tuple (c1,…,ck), the algorithm generates a tuple that
is ε-near
Running Time
- All range queries are done by approximate range searching in
time O(log n)
- Approximate Voronoi partitioning can be done in time
O(log n + ε-2(d-1))
- Running time: