CURE : An Ecient Clustering Algo rithm fo r La rge - - PowerPoint PPT Presentation

cure an e cient clustering algo rithm fo r la rge
SMART_READER_LITE
LIVE PREVIEW

CURE : An Ecient Clustering Algo rithm fo r La rge - - PowerPoint PPT Presentation

CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab o rato ries Bell Lab o rato ries sudipto@cs.stanfo rd.edu rastogi@b


slide-1
SLIDE 1 CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab
  • rato
ries Bell Lab
  • rato
ries sudipto@cs.stanfo rd.edu rastogi@b ell-labs.com shim@b ell-labs.com
slide-2
SLIDE 2 Motivation Useful technique fo r : 1. Discovering data distribution. 2. Discovering Interesting patterns. sudipto@cs.stanfo rd.edu 1
slide-3
SLIDE 3 Problem Denition Given : 1. n Data p
  • ints
2. d Dimensional Metric Space Find k pa rtitions : Data within pa rtitions a re mo re simila r than across pa rtitions. sudipto@cs.stanfo rd.edu 2
slide-4
SLIDE 4 T raditional Clustering Algo rithms Existing Algo rithms : [JD88] 1. P a rtitional 2. Hiera rchical sudipto@cs.stanfo rd.edu 3
slide-5
SLIDE 5 P a rtitional Clustering Find k pa rtitions
  • ptimizing
some criterion. Example : Squa re-erro r criterion min k X i=1 X p2C i jjp
  • m
i jj 2 m i is the mean
  • f
cluster C i . sudipto@cs.stanfo rd.edu 4
slide-6
SLIDE 6 Dra wbacks
  • f
P a rtitional Clustering Simila r results with
  • ther
criteria. Gain from splitting la rge clusters
  • sets
merging small clusters. sudipto@cs.stanfo rd.edu 5
slide-7
SLIDE 7 Hiera rchical Clustering 1. Nested P a rtitions 2. T ree Structure Mostly used : Agglomerative Hiera rchical Clustering. sudipto@cs.stanfo rd.edu 6
slide-8
SLIDE 8 Agglomerative Hiera rchical Clustering 1. Initially each p
  • int
is a distinct cluster 2. Rep eatedly merge closest clusters Closest : d mean (C i ; C j ) = jjm i
  • m
j jj d min (C i ; C j ) = min p2C i ;p 2C j jjp
  • p
jj Lik ewise d av e (C i ; C j ) and d max (C i ; C j ) sudipto@cs.stanfo rd.edu 7
slide-9
SLIDE 9 Dra wbacks
  • f
Agglomerative Hiera rchical Clustering d av e d min Clustering with d mean (C i ; C j ) and d min (C i ; C j ). d mean (C i ; C j )
  • Centroid
app roach d min (C i ; C j )
  • Minimum
Spanning T ree app roach sudipto@cs.stanfo rd.edu 8
slide-10
SLIDE 10 Summa ry
  • f
Problems with T raditional Metho ds
  • P
a rtitional Algo rithms split la rge clusters.
  • Centroid
App roach splits la rge clusters, non- hyp erspherical shap es. Center
  • f
sub-clusters can b e fa r apa rt.
  • Minimum
Spanning T ree App roach is sensitive to
  • utliers
and slight changes in p
  • sition.
Exhibits chaining eect
  • n
string
  • f
  • utliers.
sudipto@cs.stanfo rd.edu 9
slide-11
SLIDE 11 Lab eling Problem Centroid App roach : Even with co rrect centers, do w e lab el co rrectly ? Not unless w e use every p
  • int
in data set to build hiera rchy . sudipto@cs.stanfo rd.edu 10
slide-12
SLIDE 12 Related W
  • rk
  • I
  • CLARANS
: P a rtitional Algo rithm. Uses k medoids and randomized iterative imp rovement b y exchanging medoids. Multiple I/O Scans required, can converge to lo cal
  • ptimum.
Splits la rge clusters...
  • DBSCAN
Densit y based algo rithm. Uses densit y in a small(given as pa rameter) neighb
  • rho
  • d.
Finds and eliminates b
  • unda
ry p
  • ints.
Uses spanning tree
  • n
neighb
  • rho
  • d
graph. High I/O cost, p roblem with
  • utliers,
sensitive to densit y pa rameter. sudipto@cs.stanfo rd.edu 11
slide-13
SLIDE 13 Related W
  • rk
  • I
I
  • Birch
: Hiera rchical Algo rithm. Preclusters data using CF tree. Inserts p
  • ints
into tree maintaining as many leaves as w
  • uld
t main memo ry . Uses standa rd hiera rchical clustering to cluster the p reclusters. W
  • rks
fo r convex, isotropic clusters
  • f
unifo rm size. Dep endent
  • n
  • rder
  • f
insertions. sudipto@cs.stanfo rd.edu 12
slide-14
SLIDE 14 Our Contribution
  • New
Hiera rchical Clustering Algo rithm which is a middle ground
  • f
Centroid based and Spanning T ree based algo rithms.
  • Solution
to the Lab eling Problem
  • Use
  • f
Random Sampling
  • Use
  • f
P a rtitioning sudipto@cs.stanfo rd.edu 13
slide-15
SLIDE 15 Hiera rchical Clustering Algo rithm Centroid based algo rithms use 1 p
  • int
to rep resent cluster. ) to
  • little
info rmation : : : hyp erspherical clusters. Spanning T ree based algo rithms use all p
  • ints
to cluster. ) to
  • much
info rmation : : : easily misled. Use small numb er
  • f
Rep resentatives fo r each cluster. sudipto@cs.stanfo rd.edu 14
slide-16
SLIDE 16 Rep resentatives A Rep resentative set
  • f
p
  • ints
:
  • Small
in numb er : c
  • Distributed
  • ver
the cluster
  • Each
p
  • int
in cluster is close to
  • ne
rep resentative.
  • Distance
b et w een Clusters = smallest distance b et w een rep resentative sets sudipto@cs.stanfo rd.edu 15
slide-17
SLIDE 17 Finding Scattered Represen tativ es
  • Distributed
a round the center
  • f
cluster (Symmetry).
  • W
ell sp read
  • ut
  • ver
the cluster. Use F a rthest P
  • int
Heuristic to scatter the p
  • ints
  • ver
the cluster sudipto@cs.stanfo rd.edu 16
slide-18
SLIDE 18 Example sudipto@cs.stanfo rd.edu 17
slide-19
SLIDE 19 Shrinking the Rep resentatives Why do w e need to alter the Rep resentative Set ? : T
  • close
to Bounda ry
  • f
cluster.

α

Shrink unifo rmly a round the mean (center)
  • f
the cluster. sudipto@cs.stanfo rd.edu 18
slide-20
SLIDE 20 Clustering Algo rithm Initially every p
  • int
is a sepa rate cluster. Merge closest clusters till the ca rdinalit y is at least c. If Ca rdinalit y > c compute scattered rep resentatives. Use
  • nly
the 2 c rep resentative p
  • ints.
T
  • exp
edite nding closest clusters, maintain K-d tree
  • n
the rep resentative p
  • ints.
sudipto@cs.stanfo rd.edu 19
slide-21
SLIDE 21 Analysis
  • f
Running Time

Y A B X

Every cluster having A
  • r
B as closest ma y need to b e up dated. Time O (log n) p er up date, O (n) up dates. T
  • tal
time
  • ver
the algo rithm O (n 2 log n). sudipto@cs.stanfo rd.edu 20
slide-22
SLIDE 22 Random Sampling T
  • much
versus to
  • little.
If each cluster has a certain numb er
  • f
p
  • ints,
with high p robabilit y w e will sample in p rop
  • rtion
from the cluster. ) n p
  • ints
in cluster translates to s p
  • ints
in sample
  • f
size s Sample size is indep endent
  • f
n to rep resent all suciently la rge clusters sudipto@cs.stanfo rd.edu 21
slide-23
SLIDE 23 P a rtitioning Sample ma y b e la rge due to desired accuracy . W e ma y w ant sample size la rger than main memo ry . P a rtition the samples into p pa rtitions
  • P
a rtially Cluster each pa rtition,
  • collect
all pa rtitions and complete clustering. Time reduces to p
  • O
( s 2 p 2 log s p ) = O ( s 2 p log s p ). Why not la rge p ? Consider Second step ab
  • ve.
Also loss in qualit y .... sudipto@cs.stanfo rd.edu 22
slide-24
SLIDE 24 Lab eling Data
  • n
Disk Cho
  • se
some constant numb er
  • f
rep resentatives from each cluster. F
  • r
every new p
  • int
seen Find nea rest rep resentative p
  • int.
Assign the cluster lab el
  • f
this rep resentative p
  • int
to the new p
  • int.
Amelio rates the Lab eling Problem if suciently many rep resentatives chosen. sudipto@cs.stanfo rd.edu 23
slide-25
SLIDE 25 Outlier Handling Outliers cannot have many p
  • ints
close to them. Random sampling p reserves this p rop ert y . Thus clusters a round the Outlier gro ws mo re slo wly . After pa rtial clustering has b een done, thro w a w a y slo wly gro wing ( small ca rdinalit y) clusters. W e apply this p ro cess in t w
  • phases.
  • After
pa rtial clustering.
  • T
  • w
a rds the end sudipto@cs.stanfo rd.edu 24
slide-26
SLIDE 26 The Complete Algo rithm

Label data in disk Draw Random sample Partition sample Eliminate outliers Cluster partial clusters Partially cluster partitions

sudipto@cs.stanfo rd.edu 25
slide-27
SLIDE 27 Sensitivit y Analysis W e w ant to test eects
  • f
va rying :
  • shrink
age facto r
  • numb
er
  • f
rep resentatives
  • Numb
er
  • f
sample p
  • ints
100000 p
  • ints
sudipto@cs.stanfo rd.edu 26
slide-28
SLIDE 28 Sensitivit y Analysis : Shrink F acto r
slide-29
SLIDE 29 Sensitivit y Analysis : Numb er
  • f
Rep resentatives c=2 c=5 c=10 2500 sample p
  • ints
and shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 28
slide-30
SLIDE 30 Sensitivit y Analysis : Random Sample Size 2000 p
  • ints
2500 p
  • ints
3000 p
  • ints
10 rep resentatives and Shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 29
slide-31
SLIDE 31 Compa rison with Birch

20 40 60 80 100 120 140 100000 200000 300000 400000 500000 Execution Time (Sec.) Number of Points BIRCH CURE (P = 1) CURE (P = 2) CURE (P = 5)

Datale has 100000 p
  • ints.
Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 30
slide-32
SLIDE 32 Scale-up Exp eriments

5 10 15 20 25 30 35 40 45 50 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Execution Time (Sec.) Number of Sample Points P = 1 P = 2 P = 5

Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 31
slide-33
SLIDE 33 Conclusions In this w
  • rk
w e demonstrate :
  • Use
  • f
Random Sampling
  • Use
  • f
P a rtitioning
  • Clustering
Using REp resentatives.
  • Consistent
Lab eling Scheme sudipto@cs.stanfo rd.edu 32