unsupervised learning latent space analysis and clustering
play

Unsupervised learning: latent space analysis and clustering Yifeng - PowerPoint PPT Presentation

Introduction to Machine Learning Unsupervised learning: latent space analysis and clustering Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, David Sontag, Ziv Bar-Joseph Yifeng Tao Carnegie


  1. Introduction to Machine Learning Unsupervised learning: latent space analysis and clustering Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, David Sontag, Ziv Bar-Joseph Yifeng Tao Carnegie Mellon University 1

  2. Outline o Dimension reduction/latent space analysis o PCA o ICA o t-SNE o Clustering o K-means o GMM o Hierarchical/agglomerative clustering Yifeng Tao Carnegie Mellon University 2

  3. Unsupervised mapping to lower dimension o Instead of choosing subset of features, create new features (dimensions) defined as functions over all features o Don’t consider class labels, just the data points Yifeng Tao Carnegie Mellon University 3

  4. Principle Components Analysis o Given data points in d -dimensional space, project into lower dimensional space while preserving as much information as possible o E.g., find best planar approximation to 3D data o E.g., find best planar approximation to 10 4 D data o In particular, choose projection that minimizes the squared error in reconstructing original data [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 4

  5. PCA: Find Projections to Minimize Reconstruction Error o Assume data is set of d-dimensional vectors, where n -th vector is o We can represent these in terms of any d orthogonal basis vectors [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 5

  6. PCA o Note we get zero error if M=d, so all error is due to missing components. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 6

  7. PCA o More strict derivation in Bishop book. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 7

  8. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 8

  9. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 9

  10. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 10

  11. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 11

  12. Independent Components Analysis o PCA seeks directions < Y 1 ... Y M > in feature space X that minimize reconstruction error o ICA seeks directions < Y 1 ... Y M > that are most statistically independent . I.e., that minimize I(Y), the mutual information between the Y j : where H(Y) is the entropy of Y o Widely used in signal processing [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 12

  13. ICA example o Both PCA and ICA try to find a set of vectors, a basis, for the data. So you can write any point (vector) in your data as a linear combination of the basis. o In PCA the basis you want to find is the one that best explains the variability of your data. o In ICA the basis you want to find is the one in which each vector is an independent component of your data. [Slide from https://www.quora.com/What-is-the-difference-between-PCA-and-ICA ] Yifeng Tao Carnegie Mellon University 13

  14. t-Distributed Stochastic Neighbor Embedding (t-SNE) o Nonlinear dimensionality reduction technique o Manifold learning [Figure from https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py ] Yifeng Tao Carnegie Mellon University 14

  15. t-SNE à o o Two stages: o First, t-SNE constructs a probability distribution over pairs of high- dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked. o Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback-Leibler divergence between the two distributions with respect to the locations of the points in the map. o Minimized using gradient descent [Slide from https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding ] Yifeng Tao Carnegie Mellon University 15

  16. t-SNE example o Visualizing MNIST [Figure from https://lvdmaaten.github.io/tsne/ ] Yifeng Tao Carnegie Mellon University 16

  17. Clustering o Unsupervised learning o Requires data, but no labels o Detect patterns e.g. in o Group emails or search results o Customer shopping patterns o Regions of images o Useful when don’t know what you’re looking for [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 17

  18. Clustering o Basic idea: group together similar instances o Example: 2D point patterns [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 18

  19. o The clustering result can be quite different based on different rules. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 19

  20. Distance measure o What could “similar” mean? o One option: small Euclidean distance (squared) o Clustering results are crucially dependent on the measure of similarity (or distance) between “points” to be clustered o What properties should a distance measure have? o Symmetric o D(A,B)=D(B,A) o Otherwise, we can say A looks like B but B does not look like A o Positivity, and self-similarity o D(A, B) >= 0, and D(A, B)=0 iff A=B o Otherwise there will different objects that we can not tell apart o Triangle inequality o D(A, B) + D(B, C) >= D(A, C) o Otherwise one can say “A is like B, B is like C, but A is not like C at all” [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 20

  21. Clustering algorithms o Partition algorithms o K-means o Mixture of Gaussian o Spectral Clustering (in graph, not discussed in this lecture.) o Hierarchical algorithms o Bottom up - agglomerative o Top down – divisive (not discussed in this lecture.) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 21

  22. Clustering examples o Image segmentation o Goal: Break up the image into meaningful or perceptually similar regions [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 22

  23. Clustering examples o Clustering gene expression data Yifeng Tao Carnegie Mellon University 23

  24. K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 24

  25. K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 25

  26. K-means clustering: Example o Pick K random points as cluster centers (means) o Shown here for K =2 [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 26

  27. K-means clustering: Example o Iterative Step 1 o Assign data points to closest cluster center [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 27

  28. K-means clustering: Example o Iterative Step 2 o Change the cluster center to the average of the assigned points [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 28

  29. K-means clustering: Example o Repeat until convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 29

  30. K-means clustering: Example [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 30

  31. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 31

  32. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 32

  33. Properties of K-means algorithm o Guaranteed to converge in a finite number of iterations o Running time per iteration: o Assign data points to closest cluster center o O(KN) time o Change the cluster center to the average of its assigned points o O(N) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 33

  34. K-means convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 34

  35. Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 35

  36. Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 36

  37. Initialization o K-means algorithm is a heuristic o Requires initial means o It does matter what you pick! o What can go wrong? o Various schemes for preventing this kind of thing: variance-based split / merge, initialization heuristics o E.g., multiple initialization, k-means++ [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 37

  38. K-Means Getting Stuck o A local optimum: [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 38

  39. K-means not able to properly cluster o Spectral clustering will help in this case. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 39

  40. Changing the features (distance function) can help [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 40

  41. Reconsidering “ hard assignments ”? o Clusters may overlap o Some clusters may be “wider” than others o Distances can be deceiving [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 41

  42. Gaussian Mixture Models [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend