On approximate geometric k - clustering Jiri Matousek DCG 2000 - - PowerPoint PPT Presentation

on approximate geometric k clustering
SMART_READER_LITE
LIVE PREVIEW

On approximate geometric k - clustering Jiri Matousek DCG 2000 - - PowerPoint PPT Presentation

On approximate geometric k - clustering Jiri Matousek DCG 2000 Problem Given n -point set and k>1 find a partition of minimum cost Geometric cost function Approximately


slide-1
SLIDE 1

On approximate geometric k- clustering

Jiri Matousek DCG 2000

slide-2
SLIDE 2

Problem

  • Given n-point set and k>1 find a

partition of minimum cost

  • “Geometric” cost function
  • Approximately optimal
slide-3
SLIDE 3

Results

  • 2-clustering

– n log n approximate algorithm for fixed ε>0

  • k-clustering
  • Can be improved with known lower bound on

cluster size

slide-4
SLIDE 4

Really Easy Problem

  • Given X and cluster centers c1, c2, …, ck, find

the optimal clustering of X

slide-5
SLIDE 5

General Approach

  • Snap X to a grid
  • Cover the space with points

– potential cluster centers – polynomial in n

  • Test subsets of size k to find the best centers

– only test sufficiently different subsets

  • Aiming for a “near linear” algorithm
slide-6
SLIDE 6

Polynomial Grid

  • Thm: Let d and k0 be fixed. Suppose there is an algorithm A that, for a given ε>0, k

k0, and an n-point multiset X’ Rd with points lying on an integer grid of size O(n3/ε), finds an (1+ε)-approximately optimal k-clustering of X’. Then a (1+ε)- approximately optimal k0 clustering for an arbitrary n-point subset X Rd can be computed with O(n log n) preprocessing and with at most C calls to algorithm A, with various at most n-point sets X’, with k k0, and with αε instead of ε, where α>0 and C are constants.

  • Pf: Grid size δ = αε∆/5n2, ∆ = diam(X)/n. Let X be the original and X’ the snapped
  • set. If Π is the clustering of X, and Π’ is corresponding clustering of X’

– max change in distance to cluster center |diam(X)2 – (diam(X) + 2 δ)2| 5 δ diam(X) – – If cost(Π’) > 1/20∆2, since Π’ is (1+αε)-approximate for X’, then corresponding Π is (1+ε)-approximate for X – Otherwise, apply the algorithm recursively groups of nearby clusters

slide-7
SLIDE 7

Approximate Centroid Sets

  • Dfn: C is an ε-approximate centroid set for X if it intersects ε-

tolerance ball of every subset of X (of size at least s)

– ε-tolerance ball for S is centered at c(S), with radius

  • Thm: Let be a finite point set and let k2, and let C be

an ε –approximate centroid set for S with cluster size s. Then there are c1, c2, …, ck C s.t. for all k-clusterings of with all clusters of size at least s

  • Pf: By algebra using definition of centroid sets
slide-8
SLIDE 8

Contruction

slide-9
SLIDE 9

Subdivide as long as Q contains at least s/2d+1 points

Contruction

slide-10
SLIDE 10

Contruction

Closest σ larger than diam(B) B intersects at most 2d cubes Some cube has s/2d+1 points

slide-11
SLIDE 11

Centroid Sets

  • Thm: and the

construction can be performed in time

  • Pf: There are at most different side

lengths, and each cube contains at least points, so there are

slide-12
SLIDE 12

Well-separated Pairs

  • Dfn: (x, y) and (x’, y’) are ε -near if

and

  • Set P is ε –separated if no two pairs are ε –near
  • P is ε –complete for X if for every pair in X, there is an ε –near pair in P
  • k-tuples (c1, c2, … , ck) and (c1’, c2’, … , ck’) are ε –near (complete,

separated) if all pairs (ci, cj) and (ci’, cj’) are ε –near

x x’ y’ y r εr

slide-13
SLIDE 13

Approximate k-clustering

  • Thm: Let (c1, c2, … , ck) and (c1’, c2’, …, ck’) be two k-tuples

in Rd that are ε-near, ε 1/9. Let Π = ΠVor(c1, c2, …, ck) and Π’ = ΠVor(c1’, c2’, …, ck’), then cost(Π’) (1+6ε) cost(Π)

  • Pf:

c2’ c2 c1 c1’ δ εδ x S1’ S1

slide-14
SLIDE 14

Approximate k-clustering

  • Instead of looking at all k-tuples in C, we only need to look at

ε-complete set.

– Still too many for near-linear algorithm

  • Look at ε –well spread tuples instead, i.e. no subset is

1/ε isolated

– YX is 1/ε isolated if there is xX/Y such that that is a distance 1/ε diam(Y) from Y – If X is ε –well spread, then diam(X) (2/ε)k-1δ

d εR 1/ε d R x

slide-15
SLIDE 15

Building Cluster Centers

  • Let C Rd be an m-point set. Then we can compute the set C’
  • f k-tuples s.t.

– For any ε –well spread k-tuple in C, there is a tuple in C’ that is ε –near it – |C|=O(mε-k2d) – Each k-tuple in C’ is ε/2 –well spread – At least one point in each k-tuple in C’ belongs to C – There are no more than O(1) k-tuples of C’ lying near any given k- tuple in Rd – The minimum and maximum distance of points in each k-tuple in C’ are bounded by the constant multiples of minimum and maximum distance in C

  • The running time is O(m log m + m ε-k2d)
slide-16
SLIDE 16

Building Cluster Centers

  • Generate a set of ε/2 –complete set of pairs P C C

– There are O(mε-d) pairs, running time O(m log m + mε-d) – Each pair will be a basis for an ε/2 –well spread k-tuple

x y δ

slide-17
SLIDE 17

Algorithm

  • If k=1, return X
  • For k*=2,3…,k generate the set C* of k*-tuples for C
  • For each (c1, c2,…,ck*) C*, let (X1, X2,…,Xk*)=ΠVor(c1, c2,

…,ck*)

  • Let Ci be the points lying in εδ –neighborhood of ci

– δ is the smallest pairwise distance among c1,…,ck

  • For i=1…k*, call the algorithm recursively on Xi and Ci

– vary number of clusters from 1 to k-k*+1 – find the combination k1+k2+…+kk*=k with the smallest cost

  • For each k* tuple in C* with all Ci non-empty, output one with

smallest cost

slide-18
SLIDE 18

c3 c4 c1 c2

Algorithm

δ εδ X3 C3 k=10 k*=4 ki = 1,2…7

slide-19
SLIDE 19

Correctness

  • For any k-tuple (c1,…,ck), the algorithm generates a tuple that

is ε-near

slide-20
SLIDE 20

Running Time

  • All range queries are done by approximate range searching in

time O(log n)

  • Approximate Voronoi partitioning can be done in time

O(log n + ε-2(d-1))

  • Running time:
slide-21
SLIDE 21