machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop


  1. Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace

  2. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  3. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  4. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  5. Clustering

  6. Clustering Unsupervised learning (no labels for training) Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  7. Clustering Unsupervised learning (no labels for training) Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  8. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  9. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  10. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  11. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  12. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  13. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  14. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  15. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  16. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  17. Similarity over functions of inputs • The preceding measures are distances defined on the original input space X 
 • A better representation may be some function of these classification representation φ ( x ) features xplore the two

  18. Similarity: Kernels Linear (inner-product) Polynomial Radial Basis Function (RBF)

  19. Second feature Second feature First feature First feature Linear RBF kernel Figure from MML book

  20. Why kernels? “The key insight in kernel-based learning is that you can rewrite many linear models in a way that doesn’t require you to ever explicitly compute φ (x) 
 - Daume, CIML

  21. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  22. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  23. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  24. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  25. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality Similarity functions • Less formal; encodes some notion of similarity but not necessarily well defined • Can be negative • May not satisfy triangular inequality

  26. Cosine similarity

  27. Four Types of Clustering 1. Centroid-based (K-means, K-medoids)

  28. Four Types of Clustering 2. Connectivity-based (Hierarchical) Notion of Clusters: Cut off dendrogram at some depth

  29. Four Types of Clustering 3. Density-based (DBSCAN, OPTICS) Notion of Clusters: Connected regions of high density

  30. Four Types of Clustering 4. Distribution-based (Mixture Models) Notion of Clusters: Distributions on features

  31. K-Means clustering (board)

  32. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  33. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  34. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  35. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  36. K-means Clustering thm: K-means, Distance Metric: Euclidean Distanc 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Randomly initialize K centroids μ k

  37. K-means Clustering 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  38. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  39. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  40. K-means Clustering 5 4 μ 1 3 2 μ 2 μ 3 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  41. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K • K-means: Set μ to mean of points in C • K-medoids: Set μ = x for point in C with minimum SSE

  42. Let's see some examples in Python

  43. “Good” Initialization of Centroids Iteration 1 Iteration 2 Iteration 3 + 3 3 3 + + 2.5 2.5 2.5 + + 2 2 2 + 1.5 1.5 1.5 + y y y + + 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 + + 0.5 0.5 0.5 + + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend