On approximate geometric k - clustering Jiri Matousek DCG 2000 - - PowerPoint PPT Presentation

▶

Aug 01, 2023 391 likes •612 views

On approximate geometric k - clustering Jiri Matousek DCG 2000 Problem Given n -point set and k>1 find a partition of minimum cost Geometric cost function Approximately

SLIDE 1

On approximate geometric k- clustering

Jiri Matousek DCG 2000

SLIDE 2

Problem

Given n-point set and k>1 find a

partition of minimum cost

“Geometric” cost function
Approximately optimal

SLIDE 3

Results

2-clustering

– n log n approximate algorithm for fixed ε>0

k-clustering
Can be improved with known lower bound on

cluster size

SLIDE 4

Really Easy Problem

Given X and cluster centers c1, c2, …, ck, find

the optimal clustering of X

SLIDE 5

General Approach

Snap X to a grid
Cover the space with points

– potential cluster centers – polynomial in n

Test subsets of size k to find the best centers

– only test sufficiently different subsets

Aiming for a “near linear” algorithm

SLIDE 6

Polynomial Grid

Thm: Let d and k0 be fixed. Suppose there is an algorithm A that, for a given ε>0, k

k0, and an n-point multiset X’ Rd with points lying on an integer grid of size O(n3/ε), finds an (1+ε)-approximately optimal k-clustering of X’. Then a (1+ε)- approximately optimal k0 clustering for an arbitrary n-point subset X Rd can be computed with O(n log n) preprocessing and with at most C calls to algorithm A, with various at most n-point sets X’, with k k0, and with αε instead of ε, where α>0 and C are constants.

Pf: Grid size δ = αε∆/5n2, ∆ = diam(X)/n. Let X be the original and X’ the snapped
set. If Π is the clustering of X, and Π’ is corresponding clustering of X’

– max change in distance to cluster center |diam(X)2 – (diam(X) + 2 δ)2| 5 δ diam(X) – – If cost(Π’) > 1/20∆2, since Π’ is (1+αε)-approximate for X’, then corresponding Π is (1+ε)-approximate for X – Otherwise, apply the algorithm recursively groups of nearby clusters

SLIDE 7

Approximate Centroid Sets

Dfn: C is an ε-approximate centroid set for X if it intersects ε-

tolerance ball of every subset of X (of size at least s)

– ε-tolerance ball for S is centered at c(S), with radius

Thm: Let be a finite point set and let k2, and let C be

an ε –approximate centroid set for S with cluster size s. Then there are c1, c2, …, ck C s.t. for all k-clusterings of with all clusters of size at least s

Pf: By algebra using definition of centroid sets

SLIDE 8

Contruction

SLIDE 9

Subdivide as long as Q contains at least s/2d+1 points

Contruction

SLIDE 10

Contruction

Closest σ larger than diam(B) B intersects at most 2d cubes Some cube has s/2d+1 points

SLIDE 11

Centroid Sets

Thm: and the

construction can be performed in time

Pf: There are at most different side

lengths, and each cube contains at least points, so there are

SLIDE 12

Well-separated Pairs

Dfn: (x, y) and (x’, y’) are ε -near if

and

Set P is ε –separated if no two pairs are ε –near
P is ε –complete for X if for every pair in X, there is an ε –near pair in P
k-tuples (c1, c2, … , ck) and (c1’, c2’, … , ck’) are ε –near (complete,

separated) if all pairs (ci, cj) and (ci’, cj’) are ε –near

x x’ y’ y r εr

SLIDE 13

Approximate k-clustering

Thm: Let (c1, c2, … , ck) and (c1’, c2’, …, ck’) be two k-tuples

in Rd that are ε-near, ε 1/9. Let Π = ΠVor(c1, c2, …, ck) and Π’ = ΠVor(c1’, c2’, …, ck’), then cost(Π’) (1+6ε) cost(Π)

c2’ c2 c1 c1’ δ εδ x S1’ S1

SLIDE 14

Approximate k-clustering

Instead of looking at all k-tuples in C, we only need to look at

ε-complete set.

– Still too many for near-linear algorithm

Look at ε –well spread tuples instead, i.e. no subset is

1/ε isolated

– YX is 1/ε isolated if there is xX/Y such that that is a distance 1/ε diam(Y) from Y – If X is ε –well spread, then diam(X) (2/ε)k-1δ

d εR 1/ε d R x

SLIDE 15

Building Cluster Centers

Let C Rd be an m-point set. Then we can compute the set C’
f k-tuples s.t.

– For any ε –well spread k-tuple in C, there is a tuple in C’ that is ε –near it – |C|=O(mε-k2d) – Each k-tuple in C’ is ε/2 –well spread – At least one point in each k-tuple in C’ belongs to C – There are no more than O(1) k-tuples of C’ lying near any given k- tuple in Rd – The minimum and maximum distance of points in each k-tuple in C’ are bounded by the constant multiples of minimum and maximum distance in C

The running time is O(m log m + m ε-k2d)

SLIDE 16

Building Cluster Centers

Generate a set of ε/2 –complete set of pairs P C C

– There are O(mε-d) pairs, running time O(m log m + mε-d) – Each pair will be a basis for an ε/2 –well spread k-tuple

x y δ

SLIDE 17

Algorithm

If k=1, return X
For k*=2,3…,k generate the set C* of k*-tuples for C
For each (c1, c2,…,ck*) C*, let (X1, X2,…,Xk*)=ΠVor(c1, c2,

…,ck*)

Let Ci be the points lying in εδ –neighborhood of ci

– δ is the smallest pairwise distance among c1,…,ck

For i=1…k*, call the algorithm recursively on Xi and Ci

– vary number of clusters from 1 to k-k*+1 – find the combination k1+k2+…+kk*=k with the smallest cost

For each k* tuple in C* with all Ci non-empty, output one with

smallest cost

SLIDE 18

c3 c4 c1 c2

Algorithm

δ εδ X3 C3 k=10 k*=4 ki = 1,2…7

SLIDE 19

Correctness

For any k-tuple (c1,…,ck), the algorithm generates a tuple that

is ε-near

SLIDE 20

Running Time

All range queries are done by approximate range searching in

time O(log n)

Approximate Voronoi partitioning can be done in time

O(log n + ε-2(d-1))

Running time:

SLIDE 21