clustering with k means
play

Clustering with k-means Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects


  1. INTRODUCTION TO MACHINE LEARNING Clustering with 
 k-means

  2. Introduction to Machine Learning Clustering, what? ● Cluster : collection of objects ● Similar within cluster ● Dissimilar between clusters ● Clustering : grouping objects in clusters ● No labels: unsupervised classification ● Plenty possible clusterings

  3. Introduction to Machine Learning Clustering, why? ● Pa � ern Analysis ● Targeted Marketing Programs ● Visualise Data ● Student Segmentations ● pre-Processing Step ● Data Mining ● Outlier Detection ● … ● …

  4. Introduction to Machine Learning Clustering, how? ● Measure of Similarity: d( …, …) Numerical variables Metrics: Euclidean, Manhattan, … • Categorical variables Construct your own distance • • Clustering Methods k-means • Hierarchical Many variations • … •

  5. Introduction to Machine Learning Compactness and Separation ● Within Cluster Sums of Squares (WSS): Cluster Centroid Object Cluster #Clusters Measure of compactness Minimise WSS ● Between Cluster Sums of Squares (BSS): Cluster Centroid #Clusters #Objects in Cluster Sample Mean Measure of separation Maximise BSS

  6. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets Let’s take k = 3 5 y 0 − 5 0 5 10 x

  7. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 5 y 0 − 5 0 5 10 x

  8. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 y 0 − 5 0 5 10 x

  9. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 − 5 0 5 10 x

  10. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x

  11. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x The algorithm has converged!

  12. Introduction to Machine Learning Choosing k ● Goal: Find k that minimizes WSS ● Problem : WSS keeps decreasing as k increases! } ● Solution : WSS starts decreasing slowly Fix k WSS / TSS < 0.2

  13. Introduction to Machine Learning Choosing k: Scree Plot Scree Plot: Visualizing the ratio WSS / TSS as function of k 1.0 Look for the elbow in the plot 0.8 0.6 WSS / TSS Choose k = 3 0.4 0.2 1 2 3 4 5 6 7 k

  14. Introduction to Machine Learning k-Means in R > my_km <- kmeans(data, centers, nstart) ● centers: Starting centroid or #clusters ● nstart: #times R restarts with di ff erent centroids Distance: Euclidean metric > my_km$tot.withinss WSS > my_km$betweenss BSS

  15. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  16. INTRODUCTION TO MACHINE LEARNING Performance and Scaling

  17. Introduction to Machine Learning Cluster Evaluation Not trivial! There is no truth ● No true labels ● No true response Evaluation methods? Depends on the goal Goal: Compact and Separated Measurable!

  18. Introduction to Machine Learning Cluster Measures Good indication WSS and BSS: Underlying idea: Separation between clusters } ● Variance within clusters Compare ● Alternative: ● Diameter ● Intercluster Distance

  19. Introduction to Machine Learning Diameter 5 y : Objects 0 : Cluster : Distance (objects) − 5 0 5 10 15 x Measure of Compactness

  20. Introduction to Machine Learning Intercluster Distance 5 y : Objects 0 : Clusters : Distance (objects) − 5 0 5 10 15 x Measure of Separation

  21. Introduction to Machine Learning Dunn’s Index 5 y 0 − 5 0 5 10 15 x

  22. Introduction to Machine Learning Dunn’s Index Higher Dunn Be � er separated / more compact Notes: ● High computational cost ● Worst case indicator

  23. Introduction to Machine Learning Alternative measures ● Internal Validation: based on intrinsic knowledge ● BIC Index ● Silhoue � e’s Index ● External Validation: based on previous knowledge ● Hulbert’s Correlation ● Jaccard’s Coe ffi cient

  24. Introduction to Machine Learning Evaluating in R Libraries: cluster and clValid Dunn’s Index: > dunn(clusters = my_km, Data = ...) ● clusters : cluster partitioning vector ● Data : original dataset

  25. Introduction to Machine Learning Scale Issues Metrics are o � en scale dependent! Which pair is most similar ? ( Age, Income, IQ ) ● X1 = (28, 72000, 120) ● Intuition : (X1, X3) ● X2 = (56, 73000, 80) ● Euclidean : (X1, X2) ● X3 = (29, 74500, 118) Solution: Rescale income / 1000$

  26. Introduction to Machine Learning Standardizing Problem: Multiple variables on di ff erent scales Solution: Standardize your data 1. Subtract the mean 2. Divide by the standard deviation > scale(data) Note: Standardizing Di ff erent interpretation

  27. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  28. INTRODUCTION TO MACHINE LEARNING Hierarchical Clustering

  29. Introduction to Machine Learning Hierarchical Clustering Hierarchy: ● Which objects cluster first? ● Which cluster pairs merge? When? Bo � om-up: ● Starts from the objects ● Builds a hierarchy of clusters

  30. Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects

  31. Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects Distance

  32. Introduction to Machine Learning Bo � om-Up: Algorithm 1. Put every object in its own cluster

  33. Introduction to Machine Learning Bo � om-Up: Algorithm 2. Find the closest pair of clusters Merge them

  34. Introduction to Machine Learning Bo � om-Up: Algorithm 3. Compute distances between new cluster and old ones

  35. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  36. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  37. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  38. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three One cluster

  39. Introduction to Machine Learning Linkage-Methods ● Simple-Linkage: minimal distance between clusters ● Complete-Linkage: maximal distance between clusters ● Average-Linkage: average distance between clusters Di ff erent Clusterings

  40. Introduction to Machine Learning Simple-Linkage Minimal distance between objects in each clusters

  41. Introduction to Machine Learning Complete-Linkage Maximal distance between objects in each cluster

  42. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  43. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  44. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  45. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired ● Can be great outlier detector

  46. Introduction to Machine Learning Dendrogram Merge Cut Height Merge Leaves / Objects

  47. Introduction to Machine Learning Hierarchical Clustering in R Library: stats > dist(x, method) ● ● x: dataset method: distance > hclust(d, method) ● ● d: distance matrix method: linkage

  48. Introduction to Machine Learning Hierarchical: Pro and Cons ● Pros ● In-depth analysis Di ff erent pa � ern ● Linkage-methods ● Cons ● High computational cost ● Can never undo merges

  49. Introduction to Machine Learning k-Means: Pro and Cons ● Pros ● Can undo merges ● Fast computations ● Cons ● Fixed #Clusters ● Dependent on starting centroids

  50. INTRODUCTION TO MACHINE LEARNING Let’s practice!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend