machine learning
play

Machine Learning Fall 2017 Unsupervised Learning (Clustering: k - PowerPoint PPT Presentation

Machine Learning Fall 2017 Unsupervised Learning (Clustering: k -means, EM, mixture models) Professor Liang Huang (Chaps. 15-16 of CIML) Roadmap CIML Chaps. 3, 4,5,7,11,17 ,18 so far: (large-margin) supervised learning online learning:


  1. Machine Learning Fall 2017 Unsupervised Learning (Clustering: k -means, EM, mixture models) Professor Liang Huang (Chaps. 15-16 of CIML)

  2. Roadmap CIML Chaps. 3, 4,5,7,11,17 ,18 • so far: (large-margin) supervised learning • online learning: avg perceptron/MIRA, convergence proof • SVMs: formulation, KKT, dual, convex, QP y=+ 1 y=- 1 , SGD (Pegasos) • kernels and kernelized perceptron in dual; kernel SVM the man bit the dog • briefly: k -NN and leave-one-out; RL and imitation learning • structured perceptron/SVM, HMM, MLE, Viterbi DT NN VBD DT NN • what we left out: many classical algorithms CIML Chaps. 1,9,10,13 • decision trees, logistic regression, linear regression, boosting, ... • next up: unsupervised learning CIML Chaps. 15,16 • clustering: k -means, EM, mixture models, hierarchical • dimensionality reduction: PCA, non-linear (LLE, etc) 2

  3. CIML book extra topics covered: 1 Decision Trees MIRA 2 Limits of Learning week 5b aggressive MIRA 3 Geometry and Nearest Neighbors week 1 convex programming 4 The Perceptron week 2 quadratic programming 5 Practical Issues Pegasos 6 Beyond Binary Classification weeks 3,4 dual Pegasos 7 Linear Models structured Pegasos 8 Bias and Fairness important 9 Probabilistic Modeling in DL 10Neural Networks week 5 11Kernel Methods in retrospect: 12Learning Theory should start with k-NN 13Ensemble Methods should cover logistic regression 14Efficient Learning next: week 8b,9a 15Unsupervised Learning next: week 8b 16Expectation Maximization weeks 7,8a 17Structured Prediction week 5b 18Imitation Learning 3

  4. Sup=>Unsup: k- NN => k -means • let’s review a supervised learning method: nearest neighbor • SVM, perceptron (in dual) and NN are all instance-based learning • instance-based learning: store a subset of examples for classification • compression rate: SVM: very high, perceptron: medium high, NN: 0 4

  5. k -Nearest Neighbor • one way to prevent overfitting => more stable results 5

  6. NN Voronoi in 2D and 3D 6

  7. Voronoi for Euclidian and Manhattan 7

  8. Unsupervised Learning • cost of supervised learning (a) 2 • labeled data: expensive to annotate! • but there exists huge data w/o labels 0 • unsupervised learning − 2 • can only hallucinate the labels − 2 0 2 • infer some “internal structures” of data • still the “compression” view of learning • too much data => reduce it! • clustering: reduce # of examples • dimensionality reduction: reduce # of dimensions 8

  9. Challenges in Unsupervised Learning • how to evaluate the results? (a) 2 • there is no gold standard data! 0 • internal metric? • how to interpret the results? − 2 − 2 0 2 • how to “name” the clusters? • how to initialize the model/guess? • a bad initial guess can lead to very bad results • unsup is very sensitive to initialization (unlike supervised) • how to do optimization => in general no longer convex! 9

  10. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (a) 2 0 − 2 − 2 0 2 10

  11. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (b) 2 0 − 2 − 2 0 2 11

  12. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like 1-NN • recomputation of centroids based on the new assignment (c) 2 0 − 2 − 2 0 2 12

  13. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (d) 2 0 − 2 − 2 0 2 13

  14. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (e) 2 0 − 2 − 2 0 2 14

  15. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (f) 2 0 − 2 − 2 0 2 15

  16. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (g) 2 0 − 2 − 2 0 2 16

  17. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (h) 2 0 − 2 − 2 0 2 17

  18. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (i) 2 0 − 2 − 2 0 2 18

  19. k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment • how to define convergence? (i) 2 • after a fixed number of iterations, or 0 • assignments do not change, or • centroids do not change (equivalent?) or − 2 − 2 0 2 • change in objective function falls below threshold 19

  20. k- means objective function • residual sum of squares (RSS) • sum of distances from points to their centroids • guaranteed to decrease monotonically • convergence proof: decrease + finite # of clusterings 1000 J 500 0 1 2 3 4 20

  21. k -means for image segmentation 21

  22. Problems with k- means • problem: sensitive to initialization • the objective function is non-convex: many local minima • why? • k-means works well if • clusters are spherical • clusters are well separated • clusters of similar volumes • clusters have similar # of examples 22

  23. Better (“soft”) k -means? • random restarts -- definitely helps • soft clusters => EM with Gaussian Mixture Model p ( x ) (i) 2 1 (a) 0.5 0.2 0 0.3 0.5 0 x 0 0.5 1 − 2 − 2 0 2 1 (b) 0.5 0 23 0 0.5 1

  24. k -means • randomize k initial centroids • repeat the two steps until convergence • E-step: assignment each example to centroids (Voronoi) • M-step: recomputation of centroids (based on the new assignment) (a) 2 0 − 2 − 2 0 2 24

  25. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 0 − 2 − 2 0 2 (a) 25

  26. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities “fractional 2 assignments” 0 − 2 − 2 0 2 (b) 26

  27. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 1 0 − 2 − 2 0 2 (c) 27

  28. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 2 0 − 2 − 2 0 2 (d) 28

  29. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 5 0 − 2 − 2 0 2 (e) 29

  30. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 20 0 − 2 − 2 0 2 (f) 30

  31. EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 20 0 − 2 − 2 0 2 (f) 31

  32. EM for Gaussian Mixtures 2 2 L = 1 0 0 − 2 − 2 − 2 0 2 − 2 0 2 (b) (c) 32

  33. Convergence • EM converges much slower than k -means • can’t use “assignment doesn’t change” for convergence • use log likelihood of the data • stop if increase in log likelihood smaller than threshold • or a maximum # of iterations has reached L = log P (data) = log Π j P ( x j ) X = log P ( x j ) j X X = log P ( c i ) P ( x j | c i ) j i 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend