 
              Machine Learning Fall 2017 Unsupervised Learning (Clustering: k -means, EM, mixture models) Professor Liang Huang (Chaps. 15-16 of CIML)
Roadmap CIML Chaps. 3, 4,5,7,11,17 ,18 • so far: (large-margin) supervised learning • online learning: avg perceptron/MIRA, convergence proof • SVMs: formulation, KKT, dual, convex, QP y=+ 1 y=- 1 , SGD (Pegasos) • kernels and kernelized perceptron in dual; kernel SVM the man bit the dog • briefly: k -NN and leave-one-out; RL and imitation learning • structured perceptron/SVM, HMM, MLE, Viterbi DT NN VBD DT NN • what we left out: many classical algorithms CIML Chaps. 1,9,10,13 • decision trees, logistic regression, linear regression, boosting, ... • next up: unsupervised learning CIML Chaps. 15,16 • clustering: k -means, EM, mixture models, hierarchical • dimensionality reduction: PCA, non-linear (LLE, etc) 2
CIML book extra topics covered: 1 Decision Trees MIRA 2 Limits of Learning week 5b aggressive MIRA 3 Geometry and Nearest Neighbors week 1 convex programming 4 The Perceptron week 2 quadratic programming 5 Practical Issues Pegasos 6 Beyond Binary Classification weeks 3,4 dual Pegasos 7 Linear Models structured Pegasos 8 Bias and Fairness important 9 Probabilistic Modeling in DL 10Neural Networks week 5 11Kernel Methods in retrospect: 12Learning Theory should start with k-NN 13Ensemble Methods should cover logistic regression 14Efficient Learning next: week 8b,9a 15Unsupervised Learning next: week 8b 16Expectation Maximization weeks 7,8a 17Structured Prediction week 5b 18Imitation Learning 3
Sup=>Unsup: k- NN => k -means • let’s review a supervised learning method: nearest neighbor • SVM, perceptron (in dual) and NN are all instance-based learning • instance-based learning: store a subset of examples for classification • compression rate: SVM: very high, perceptron: medium high, NN: 0 4
k -Nearest Neighbor • one way to prevent overfitting => more stable results 5
NN Voronoi in 2D and 3D 6
Voronoi for Euclidian and Manhattan 7
Unsupervised Learning • cost of supervised learning (a) 2 • labeled data: expensive to annotate! • but there exists huge data w/o labels 0 • unsupervised learning − 2 • can only hallucinate the labels − 2 0 2 • infer some “internal structures” of data • still the “compression” view of learning • too much data => reduce it! • clustering: reduce # of examples • dimensionality reduction: reduce # of dimensions 8
Challenges in Unsupervised Learning • how to evaluate the results? (a) 2 • there is no gold standard data! 0 • internal metric? • how to interpret the results? − 2 − 2 0 2 • how to “name” the clusters? • how to initialize the model/guess? • a bad initial guess can lead to very bad results • unsup is very sensitive to initialization (unlike supervised) • how to do optimization => in general no longer convex! 9
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (a) 2 0 − 2 − 2 0 2 10
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (b) 2 0 − 2 − 2 0 2 11
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like 1-NN • recomputation of centroids based on the new assignment (c) 2 0 − 2 − 2 0 2 12
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (d) 2 0 − 2 − 2 0 2 13
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (e) 2 0 − 2 − 2 0 2 14
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (f) 2 0 − 2 − 2 0 2 15
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (g) 2 0 − 2 − 2 0 2 16
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (h) 2 0 − 2 − 2 0 2 17
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment (i) 2 0 − 2 − 2 0 2 18
k -means • (randomly) pick k points to be initial centroids • repeat the two steps until convergence • assignment to centroids: voronoi, like NN • recomputation of centroids based on the new assignment • how to define convergence? (i) 2 • after a fixed number of iterations, or 0 • assignments do not change, or • centroids do not change (equivalent?) or − 2 − 2 0 2 • change in objective function falls below threshold 19
k- means objective function • residual sum of squares (RSS) • sum of distances from points to their centroids • guaranteed to decrease monotonically • convergence proof: decrease + finite # of clusterings 1000 J 500 0 1 2 3 4 20
k -means for image segmentation 21
Problems with k- means • problem: sensitive to initialization • the objective function is non-convex: many local minima • why? • k-means works well if • clusters are spherical • clusters are well separated • clusters of similar volumes • clusters have similar # of examples 22
Better (“soft”) k -means? • random restarts -- definitely helps • soft clusters => EM with Gaussian Mixture Model p ( x ) (i) 2 1 (a) 0.5 0.2 0 0.3 0.5 0 x 0 0.5 1 − 2 − 2 0 2 1 (b) 0.5 0 23 0 0.5 1
k -means • randomize k initial centroids • repeat the two steps until convergence • E-step: assignment each example to centroids (Voronoi) • M-step: recomputation of centroids (based on the new assignment) (a) 2 0 − 2 − 2 0 2 24
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 0 − 2 − 2 0 2 (a) 25
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities “fractional 2 assignments” 0 − 2 − 2 0 2 (b) 26
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 1 0 − 2 − 2 0 2 (c) 27
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 2 0 − 2 − 2 0 2 (d) 28
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 5 0 − 2 − 2 0 2 (e) 29
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 20 0 − 2 − 2 0 2 (f) 30
EM for Gaussian Mixtures • randomize k means, covariances, mixing coefficients • repeat the two steps until convergence • E-step: evaluate the responsibilities using current parameters • M-step: reestimate parameters using current responsibilities 2 L = 20 0 − 2 − 2 0 2 (f) 31
EM for Gaussian Mixtures 2 2 L = 1 0 0 − 2 − 2 − 2 0 2 − 2 0 2 (b) (c) 32
Convergence • EM converges much slower than k -means • can’t use “assignment doesn’t change” for convergence • use log likelihood of the data • stop if increase in log likelihood smaller than threshold • or a maximum # of iterations has reached L = log P (data) = log Π j P ( x j ) X = log P ( x j ) j X X = log P ( c i ) P ( x j | c i ) j i 33
Recommend
More recommend