chapter 5 2 clu lust ster erin ing
play

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typos fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point as a member of its own -neighborhood IRDM 15/16 12 Nov 2015


  1. Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typo’s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point 𝑦 as a member of its own πœ— -neighborhood IRDM β€˜15/16 12 Nov 2015

  2. The Fir Th e First M Midt idterm T T es est Novem vember 19 th th 2015 2015 Where: Wh GΓΌnter-Hotz HΓΆrsaal (E2.2) Material: the first four lectures, the first two homeworks You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, t toothbrush, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. V-2: 2 IRDM β€˜15/16

  3. The Fin Th e Final Ex l Exam am Preliminary dates: Februar ary 15 15 th th an and 16 16 th th 2016 2016 Oral e l exam. Can o only ly be be t taken wh when you passed tw two o out o t of th three mid id-term t tests. More details l later. V-2: 3 IRDM β€˜15/16

  4. IRDM Chapter 5, overview Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β€”15 V-2: 4 IRDM β€˜15/16

  5. IRDM Chapter 5, today Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β€”15 V-2: 5 IRDM β€˜15/16

  6. Chapter 5.5: Hier ierarchi hical C Clust lustering ng Aggarwal Ch. 6.4 V-2: 6 IRDM β€˜15/16

  7. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 6 V-2: 7 IRDM β€˜15/16

  8. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 5 V-2: 8 IRDM β€˜15/16

  9. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 4 V-2: 9 IRDM β€˜15/16

  10. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 3 V-2: 10 IRDM β€˜15/16

  11. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 2 V-2: 11 IRDM β€˜15/16

  12. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 1 V-2: 12 IRDM β€˜15/16

  13. Dendrograms Distance is β‰ˆ 0.7 The difference in height between the tree and its subtrees shows the distance between the two branches V-2: 13 IRDM β€˜15/16

  14. Dendrograms and clusters V-2: 14 IRDM β€˜15/16

  15. Dendrograms, revisited Dendrograms show the hierarchy of the clustering Number of clusters can be deduced from a dendrogram  higher branches Outliers can be detected from a dendrogram  single points that are far from others V-2: 15 IRDM β€˜15/16

  16. Agglomerative and Divisive Agglome omerative: bottom-up  start with π‘œ clusters  combine two closest clusters into a cluster of one bigger cluster Div ivis isiv ive: top-down  start with 1 cluster  divide the cluster into two  divide the largest (per diameter) cluster into smaller clusters V-2: 16 IRDM β€˜15/16

  17. Cluster distances The distance between two points 𝑦 and 𝑧 is 𝑒 ( 𝑦 , 𝑧 ) What is the distance between two clusters? Many intuitive definitions – no universal truth  different cluster distances yield different clusterings  the selection of cluster distance depends on application Some distances between clusters 𝐢 and 𝐷 :  minimum distance { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 } 𝑒 ( 𝐢 , 𝐷 ) = min  maximum distance { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 } 𝑒 ( 𝐢 , 𝐷 ) = max  average distance 𝑒 ( 𝐢 , 𝐷 ) = 𝑏𝑏𝑏 { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  distance of centroids 𝑒 ( 𝐢 , 𝐷 ) = 𝑒 ( 𝜈 𝐢 , 𝜈 𝐷 ) , where 𝜈 𝐢 is the centroid of 𝐢 and 𝜈 𝐷 is the centroid of 𝐷 V-2: 17 IRDM β€˜15/16

  18. Single link The distance between two clusters is the distance between the closest points { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  𝑒 ( 𝐢 , 𝐷 ) = min V-2: 18 IRDM β€˜15/16

  19. Strength of single-link Can n ha hand ndle non non-spheric ical l clu clusters o of unequal s l size ize V-2: 19 IRDM β€˜15/16

  20. Weaknesses of single-link Se Sens nsitive to o noi noise a and nd out outliers Prod oduc uces e elong ongated clus usters V-2: 20 IRDM β€˜15/16

  21. Complete link The distance between two clusters is the distance between the furthest points { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  𝑒 ( 𝐢 , 𝐷 ) = max V-2: 21 IRDM β€˜15/16

  22. Strengths of complete link Le Less s sus usceptible t to o noi noise and nd out outliers V-2: 22 IRDM β€˜15/16

  23. Weaknesses of complete-link Break aks s largest c st clusters Bia iased t towards s spherica ical clu l clusters V-2: 23 IRDM β€˜15/16

  24. Group average and Mean distance Gr Group oup average is the average of pairwise distances 𝑒 𝑦 , 𝑧  𝑒 𝐢 , 𝐷 = avg 𝑒 𝑦 , 𝑧 : 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 = βˆ‘ π‘¦βˆˆπΆ , π‘§βˆˆπ· 𝐢 𝐷 Mean an di dista stance is the distance of the cluster centroids  𝑒 𝐢 , 𝐷 = 𝑒 ( 𝜈 𝐢 , 𝜈 𝐷 ) V-2: 24 IRDM β€˜15/16

  25. Properties of group average A compromise between single and complete link Le Less s sus usceptible t to o noi noise and nd out outliers  similar to complete link Bia iased t towards s spherica ical clu l clusters  similar to complete link V-2: 25 IRDM β€˜15/16

  26. Ward’s method Ward’s dis istanc nce between clusters 𝐡 and 𝐢 is the increase in sum of squared errors (SSE) when the two clusters are merged  SSE for cluster 𝐡 is 𝑇𝑇𝐹 𝐡 = βˆ‘ 𝑦 βˆ’ 𝜈 𝐡 2 π‘¦βˆˆπ΅  difference for merging clusters 𝐡 and 𝐢 into cluster 𝐷 is then 𝑒 ( 𝐡 , 𝐢 ) = Δ𝑇𝑇𝐹 𝐷 = 𝑇𝑇𝐹 𝐷 – 𝑇𝑇𝐹 𝐡 – 𝑇𝑇𝐹 𝐢  or, equivalently, weighted mean distance 𝐡 𝐢 𝑒 𝐡 , 𝐢 = 𝐡 + | 𝐢 | 𝜈 A βˆ’ 𝜈 𝐢 2 V-2: 26 IRDM β€˜15/16

  27. Discussion on Ward’s method Le Less s sus usceptible t to o noi noise and nd out outliers Biase ases t s towar ards sp s spherical al clust sters Hierarchical analogue of 𝑙 -means  hence many shared pro’s and con’s  can be used to initialise 𝑙 -means V-2: 27 IRDM β€˜15/16

  28. Comparison Single Complete link link Group Ward’s average method V-2: 28 IRDM β€˜15/16

  29. Comparison Single Complete link link Group Ward’s average method V-2: 29 IRDM β€˜15/16

  30. Comparison Single Complete link link Group Ward’s average method V-2: 30 IRDM β€˜15/16

  31. Comparison Single Complete link link Group Ward’s average method V-2: 31 IRDM β€˜15/16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend