cure an e cient clustering algo rithm fo r la rge
play

CURE : An Ecient Clustering Algo rithm fo r La rge - PowerPoint PPT Presentation

CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab o rato ries Bell Lab o rato ries sudipto@cs.stanfo rd.edu rastogi@b


  1. CURE : An E�cient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab o rato ries Bell Lab o rato ries sudipto@cs.stanfo rd.edu rastogi@b ell-labs.com shim@b ell-labs.com

  2. Motivation Useful technique fo r : 1. Discovering data distribution. 2. Discovering Interesting patterns. sudipto@cs.stanfo rd.edu 1

  3. Problem De�nition Given : 1. n Data p oints 2. d Dimensional Metric Space Find k pa rtitions : Data within pa rtitions a re mo re simila r than across pa rtitions. sudipto@cs.stanfo rd.edu 2

  4. T raditional Clustering Algo rithms Existing Algo rithms : [JD88] 1. P a rtitional 2. Hiera rchical sudipto@cs.stanfo rd.edu 3

  5. P a rtitional Clustering Find k pa rtitions optimizing some criterion. Example : Squa re-erro r criterion k X X 2 min jj p � jj m i i =1 p 2 C i is the mean of cluster . m C i i sudipto@cs.stanfo rd.edu 4

  6. Dra wbacks of P a rtitional Clustering Simila r results with other criteria. Gain from splitting la rge clusters o�sets merging small clusters. sudipto@cs.stanfo rd.edu 5

  7. Hiera rchical Clustering 1. Nested P a rtitions 2. T ree Structure Mostly used : Agglomerative Hiera rchical Clustering. sudipto@cs.stanfo rd.edu 6

  8. Agglomerative Hiera rchical Clustering 1. Initially each p oint is a distinct cluster 2. Rep eatedly merge closest clusters Closest : ( C ) = jj m � jj d ; C m mean i j i j 0 d ( C ; C ) = min jj p � p jj min i j p 2 C ;p 0 2 C i j Lik ewise ( C ) and ( C ) d ; C d ; C av e i j max i j sudipto@cs.stanfo rd.edu 7

  9. Dra wbacks of Agglomerative Hiera rchical Clustering d d av e min Clustering with ( C ) and ( C ) . d ; C d ; C mean i j min i j d ( C ; C ) � Centroid app roach mean i j ( C ) � Minimum Spanning T ree app roach d ; C min i j sudipto@cs.stanfo rd.edu 8

  10. Summa ry of Problems with T raditional Metho ds � P a rtitional Algo rithms split la rge clusters. � Centroid App roach splits la rge clusters, non- hyp erspherical shap es. Center of sub-clusters can b e fa r apa rt. � Minimum Spanning T ree App roach is sensitive to outliers and slight changes in p osition. Exhibits chaining e�ect on string of outliers. sudipto@cs.stanfo rd.edu 9

  11. Lab eling Problem Centroid App roach : Even with co rrect centers, do w e lab el co rrectly ? Not unless w e use every p oint in data set to build hiera rchy . sudipto@cs.stanfo rd.edu 10

  12. Related W o rk - I � CLARANS : P a rtitional Algo rithm. Uses k medoids and randomized iterative imp rovement b y exchanging medoids. Multiple I/O Scans required, can converge to lo cal optimum. Splits la rge clusters... � DBSCAN Densit y based algo rithm. Uses densit y in a small(given as pa rameter) neighb o rho o d. Finds and eliminates b ounda ry p oints. Uses spanning tree on neighb o rho o d graph. High I/O cost, p roblem with outliers, sensitive to densit y pa rameter. sudipto@cs.stanfo rd.edu 11

  13. Related W o rk - I I � Birch : Hiera rchical Algo rithm. Preclusters data using CF tree. Inserts p oints into tree maintaining as many leaves as w ould �t main memo ry . Uses standa rd hiera rchical clustering to cluster the p reclusters. W o rks fo r convex, isotropic clusters of unifo rm size. Dep endent on o rder of insertions. sudipto@cs.stanfo rd.edu 12

  14. Our Contribution � New Hiera rchical Clustering Algo rithm which is a middle ground of Centroid based and Spanning T ree based algo rithms. � Solution to the Lab eling Problem � Use of Random Sampling � Use of P a rtitioning sudipto@cs.stanfo rd.edu 13

  15. Hiera rchical Clustering Algo rithm Centroid based algo rithms use 1 p oint to rep resent cluster. ) to o little info rmation hyp erspherical clusters. : : : Spanning T ree based algo rithms use all p oints to cluster. ) to o much info rmation : : : easily misled. Use small numb er of Rep resentatives fo r each cluster. sudipto@cs.stanfo rd.edu 14

  16. Rep resentatives A set of p oints : Rep resentative � Small in numb er : c � Distributed over the cluster � Each p oint in cluster is close to one rep resentative. � Distance b et w een Clusters = smallest distance b et w een rep resentative sets sudipto@cs.stanfo rd.edu 15

  17. Finding Scattered Represen tativ es � Distributed a round the center of cluster (Symmetry). � W ell sp read out over the cluster. Use F a rthest P oint Heuristic to scatter the p oints over the cluster sudipto@cs.stanfo rd.edu 16

  18. Example sudipto@cs.stanfo rd.edu 17

  19. Shrinking the Rep resentatives Why do w e need to alter the Rep resentative Set ? : T o o close to Bounda ry of cluster. α Shrink unifo rmly a round the mean (center) of the cluster. sudipto@cs.stanfo rd.edu 18

  20. Clustering Algo rithm Initially every p oint is a sepa rate cluster. Merge closest clusters till the ca rdinalit y is at least c. If Ca rdinalit y c compute scattered rep resentatives. > Use only the 2 c rep resentative p oints. T o exp edite �nding closest clusters, maintain K-d tree on the rep resentative p oints. sudipto@cs.stanfo rd.edu 19

  21. Analysis of Running Time X A B Y Every cluster having o r as closest ma y need to b e up dated. A B Time O (log n ) p er up date, O ( n ) up dates. 2 T otal time over the algo rithm O ( n log n ) . sudipto@cs.stanfo rd.edu 20

  22. Random Sampling T o o much versus to o little. If each cluster has a certain numb er of p oints, with high p robabilit y w e will sample in p rop o rtion from the cluster. ) p oints in cluster translates to p oints in sample of size �n �s s Sample size is indep endent of n to rep resent all su�ciently la rge clusters sudipto@cs.stanfo rd.edu 21

  23. P a rtitioning Sample ma y b e la rge due to desired accuracy . W e ma y w ant sample size la rger than main memo ry . P a rtition the samples into p pa rtitions � P a rtially Cluster each pa rtition, � collect all pa rtitions and complete clustering. 2 2 s s s s Time reduces to p � O ( log ) = O ( log ) . 2 p p p p Why not la rge p ? Consider Second step ab ove. Also loss in qualit y .... sudipto@cs.stanfo rd.edu 22

  24. Lab eling Data on Disk Cho ose some constant numb er of rep resentatives from each cluster. F o r every new p oint seen Find nea rest rep resentative p oint. Assign the cluster lab el of this rep resentative p oint to the new p oint. Amelio rates the Lab eling Problem if su�ciently many rep resentatives chosen. sudipto@cs.stanfo rd.edu 23

  25. Outlier Handling Outliers cannot have many p oints close to them. Random sampling p reserves this p rop ert y . Thus clusters a round the Outlier gro ws mo re slo wly . After pa rtial clustering has b een done, thro w a w a y slo wly gro wing ( small ca rdinalit y) clusters. W e apply this p ro cess in t w o phases. � After pa rtial clustering. � T o w a rds the end sudipto@cs.stanfo rd.edu 24

  26. The Complete Algo rithm Draw Random sample Partition sample Partially cluster partitions Eliminate outliers Cluster partial clusters Label data in disk sudipto@cs.stanfo rd.edu 25

  27. Sensitivit y Analysis W e w ant to test e�ects of va rying : � shrink age facto r � numb er of rep resentatives � Numb er of sample p oints 100000 p oints sudipto@cs.stanfo rd.edu 26

  28. Sensitivit y Analysis : Shrink F acto r

  29. Sensitivit y Analysis : Numb er of Rep resentatives c=2 c=10 c=5 2500 sample p oints and shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 28

  30. Sensitivit y Analysis : Random Sample Size 2000 p oints 2500 p oints 3000 p oints 10 rep resentatives and Shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 29

  31. Compa rison with Birch BIRCH 140 CURE (P = 1) CURE (P = 2) 120 CURE (P = 5) Execution Time (Sec.) 100 80 60 40 20 0 100000 200000 300000 400000 500000 Number of Points Data�le has 100000 p oints. Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 30

  32. Scale-up Exp eriments 50 P = 1 45 P = 2 P = 5 40 Execution Time (Sec.) 35 30 25 20 15 10 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Sample Points Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend