SLIDE 1 CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab
ries Bell Lab
ries sudipto@cs.stanfo rd.edu rastogi@b ell-labs.com shim@b ell-labs.com
SLIDE 2 Motivation Useful technique fo r : 1. Discovering data distribution. 2. Discovering Interesting patterns. sudipto@cs.stanfo rd.edu 1
SLIDE 3 Problem Denition Given : 1. n Data p
2. d Dimensional Metric Space Find k pa rtitions : Data within pa rtitions a re mo re simila r than across pa rtitions. sudipto@cs.stanfo rd.edu 2
SLIDE 4 T raditional Clustering Algo rithms Existing Algo rithms : [JD88] 1. P a rtitional 2. Hiera rchical sudipto@cs.stanfo rd.edu 3
SLIDE 5 P a rtitional Clustering Find k pa rtitions
some criterion. Example : Squa re-erro r criterion min k X i=1 X p2C i jjp
i jj 2 m i is the mean
cluster C i . sudipto@cs.stanfo rd.edu 4
SLIDE 6 Dra wbacks
P a rtitional Clustering Simila r results with
criteria. Gain from splitting la rge clusters
merging small clusters. sudipto@cs.stanfo rd.edu 5
SLIDE 7 Hiera rchical Clustering 1. Nested P a rtitions 2. T ree Structure Mostly used : Agglomerative Hiera rchical Clustering. sudipto@cs.stanfo rd.edu 6
SLIDE 8 Agglomerative Hiera rchical Clustering 1. Initially each p
is a distinct cluster 2. Rep eatedly merge closest clusters Closest : d mean (C i ; C j ) = jjm i
j jj d min (C i ; C j ) = min p2C i ;p 2C j jjp
jj Lik ewise d av e (C i ; C j ) and d max (C i ; C j ) sudipto@cs.stanfo rd.edu 7
SLIDE 9 Dra wbacks
Agglomerative Hiera rchical Clustering d av e d min Clustering with d mean (C i ; C j ) and d min (C i ; C j ). d mean (C i ; C j )
app roach d min (C i ; C j )
Spanning T ree app roach sudipto@cs.stanfo rd.edu 8
SLIDE 10 Summa ry
Problems with T raditional Metho ds
a rtitional Algo rithms split la rge clusters.
App roach splits la rge clusters, non- hyp erspherical shap es. Center
sub-clusters can b e fa r apa rt.
Spanning T ree App roach is sensitive to
and slight changes in p
Exhibits chaining eect
string
sudipto@cs.stanfo rd.edu 9
SLIDE 11 Lab eling Problem Centroid App roach : Even with co rrect centers, do w e lab el co rrectly ? Not unless w e use every p
in data set to build hiera rchy . sudipto@cs.stanfo rd.edu 10
SLIDE 12 Related W
: P a rtitional Algo rithm. Uses k medoids and randomized iterative imp rovement b y exchanging medoids. Multiple I/O Scans required, can converge to lo cal
Splits la rge clusters...
Densit y based algo rithm. Uses densit y in a small(given as pa rameter) neighb
Finds and eliminates b
ry p
Uses spanning tree
neighb
graph. High I/O cost, p roblem with
sensitive to densit y pa rameter. sudipto@cs.stanfo rd.edu 11
SLIDE 13 Related W
I
: Hiera rchical Algo rithm. Preclusters data using CF tree. Inserts p
into tree maintaining as many leaves as w
t main memo ry . Uses standa rd hiera rchical clustering to cluster the p reclusters. W
fo r convex, isotropic clusters
unifo rm size. Dep endent
insertions. sudipto@cs.stanfo rd.edu 12
SLIDE 14 Our Contribution
Hiera rchical Clustering Algo rithm which is a middle ground
Centroid based and Spanning T ree based algo rithms.
to the Lab eling Problem
Random Sampling
P a rtitioning sudipto@cs.stanfo rd.edu 13
SLIDE 15 Hiera rchical Clustering Algo rithm Centroid based algo rithms use 1 p
to rep resent cluster. ) to
info rmation : : : hyp erspherical clusters. Spanning T ree based algo rithms use all p
to cluster. ) to
info rmation : : : easily misled. Use small numb er
Rep resentatives fo r each cluster. sudipto@cs.stanfo rd.edu 14
SLIDE 16 Rep resentatives A Rep resentative set
p
:
in numb er : c
the cluster
p
in cluster is close to
rep resentative.
b et w een Clusters = smallest distance b et w een rep resentative sets sudipto@cs.stanfo rd.edu 15
SLIDE 17 Finding Scattered Represen tativ es
a round the center
cluster (Symmetry).
ell sp read
the cluster. Use F a rthest P
Heuristic to scatter the p
the cluster sudipto@cs.stanfo rd.edu 16
SLIDE 18 Example sudipto@cs.stanfo rd.edu 17
SLIDE 19 Shrinking the Rep resentatives Why do w e need to alter the Rep resentative Set ? : T
to Bounda ry
cluster.
α
Shrink unifo rmly a round the mean (center)
the cluster. sudipto@cs.stanfo rd.edu 18
SLIDE 20 Clustering Algo rithm Initially every p
is a sepa rate cluster. Merge closest clusters till the ca rdinalit y is at least c. If Ca rdinalit y > c compute scattered rep resentatives. Use
the 2 c rep resentative p
T
edite nding closest clusters, maintain K-d tree
the rep resentative p
sudipto@cs.stanfo rd.edu 19
SLIDE 21 Analysis
Running Time
Y A B X
Every cluster having A
B as closest ma y need to b e up dated. Time O (log n) p er up date, O (n) up dates. T
time
the algo rithm O (n 2 log n). sudipto@cs.stanfo rd.edu 20
SLIDE 22 Random Sampling T
versus to
If each cluster has a certain numb er
p
with high p robabilit y w e will sample in p rop
from the cluster. ) n p
in cluster translates to s p
in sample
size s Sample size is indep endent
n to rep resent all suciently la rge clusters sudipto@cs.stanfo rd.edu 21
SLIDE 23 P a rtitioning Sample ma y b e la rge due to desired accuracy . W e ma y w ant sample size la rger than main memo ry . P a rtition the samples into p pa rtitions
a rtially Cluster each pa rtition,
all pa rtitions and complete clustering. Time reduces to p
( s 2 p 2 log s p ) = O ( s 2 p log s p ). Why not la rge p ? Consider Second step ab
Also loss in qualit y .... sudipto@cs.stanfo rd.edu 22
SLIDE 24 Lab eling Data
Disk Cho
some constant numb er
rep resentatives from each cluster. F
every new p
seen Find nea rest rep resentative p
Assign the cluster lab el
this rep resentative p
to the new p
Amelio rates the Lab eling Problem if suciently many rep resentatives chosen. sudipto@cs.stanfo rd.edu 23
SLIDE 25 Outlier Handling Outliers cannot have many p
close to them. Random sampling p reserves this p rop ert y . Thus clusters a round the Outlier gro ws mo re slo wly . After pa rtial clustering has b een done, thro w a w a y slo wly gro wing ( small ca rdinalit y) clusters. W e apply this p ro cess in t w
pa rtial clustering.
a rds the end sudipto@cs.stanfo rd.edu 24
SLIDE 26 The Complete Algo rithm
Label data in disk Draw Random sample Partition sample Eliminate outliers Cluster partial clusters Partially cluster partitions
sudipto@cs.stanfo rd.edu 25
SLIDE 27 Sensitivit y Analysis W e w ant to test eects
va rying :
age facto r
er
rep resentatives
er
sample p
100000 p
sudipto@cs.stanfo rd.edu 26
SLIDE 28 Sensitivit y Analysis : Shrink F acto r
SLIDE 29 Sensitivit y Analysis : Numb er
Rep resentatives c=2 c=5 c=10 2500 sample p
and shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 28
SLIDE 30 Sensitivit y Analysis : Random Sample Size 2000 p
2500 p
3000 p
10 rep resentatives and Shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 29
SLIDE 31 Compa rison with Birch
20 40 60 80 100 120 140 100000 200000 300000 400000 500000 Execution Time (Sec.) Number of Points BIRCH CURE (P = 1) CURE (P = 2) CURE (P = 5)
Datale has 100000 p
Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 30
SLIDE 32 Scale-up Exp eriments
5 10 15 20 25 30 35 40 45 50 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Execution Time (Sec.) Number of Sample Points P = 1 P = 2 P = 5
Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 31
SLIDE 33 Conclusions In this w
w e demonstrate :
Random Sampling
P a rtitioning
Using REp resentatives.
Lab eling Scheme sudipto@cs.stanfo rd.edu 32