Machine Learning Fall 2017 Unsupervised Learning (Clustering: k - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Fall 2017 Unsupervised Learning (Clustering: k - - PowerPoint PPT Presentation

Machine Learning Fall 2017 Unsupervised Learning (Clustering: k -means, EM, mixture models) Professor Liang Huang (Chaps. 15-16 of CIML) Roadmap CIML Chaps. 3, 4,5,7,11,17 ,18 so far: (large-margin) supervised learning online learning:


slide-1
SLIDE 1

Machine Learning

Fall 2017

Professor Liang Huang

Unsupervised Learning

(Clustering: k-means, EM, mixture models)

(Chaps. 15-16 of CIML)

slide-2
SLIDE 2

Roadmap

  • so far: (large-margin) supervised learning
  • online learning: avg perceptron/MIRA, convergence proof
  • SVMs: formulation, KKT, dual, convex, QP

, SGD (Pegasos)

  • kernels and kernelized perceptron in dual; kernel SVM
  • briefly: k-NN and leave-one-out; RL and imitation learning
  • structured perceptron/SVM, HMM, MLE,

Viterbi

  • what we left out: many classical algorithms
  • decision trees, logistic regression, linear regression, boosting, ...
  • next up: unsupervised learning
  • clustering: k-means, EM, mixture models, hierarchical
  • dimensionality reduction: PCA, non-linear (LLE, etc)

2

y=-1 y=+1

the man bit the dog DT NN VBD DT NN

CIML Chaps. 3,4,5,7,11,17,18 CIML Chaps. 15,16 CIML Chaps. 1,9,10,13

slide-3
SLIDE 3

CIML book

3

1 Decision Trees 2 Limits of Learning 3 Geometry and Nearest Neighbors 4 The Perceptron 5 Practical Issues 6 Beyond Binary Classification 7 Linear Models 8 Bias and Fairness 9 Probabilistic Modeling 10Neural Networks 11Kernel Methods 12Learning Theory 13Ensemble Methods 14Efficient Learning 15Unsupervised Learning 16Expectation Maximization 17Structured Prediction 18Imitation Learning

week 5b week 1 week 2 weeks 3,4 week 5 weeks 7,8a week 5b next: week 8b,9a next: week 8b

extra topics covered: MIRA aggressive MIRA convex programming quadratic programming Pegasos dual Pegasos structured Pegasos in retrospect: should start with k-NN should cover logistic regression

in DL important

slide-4
SLIDE 4

Sup=>Unsup: k-NN => k-means

  • let’s review a supervised learning method: nearest neighbor
  • SVM, perceptron (in dual) and NN are all instance-based learning
  • instance-based learning: store a subset of examples for classification
  • compression rate: SVM: very high, perceptron: medium high, NN: 0

4

slide-5
SLIDE 5

k-Nearest Neighbor

  • one way to prevent overfitting => more stable results

5

slide-6
SLIDE 6

NN Voronoi in 2D and 3D

6

slide-7
SLIDE 7

Voronoi for Euclidian and Manhattan

7

slide-8
SLIDE 8

Unsupervised Learning

  • cost of supervised learning
  • labeled data: expensive to annotate!
  • but there exists huge data w/o labels
  • unsupervised learning
  • can only hallucinate the labels
  • infer some “internal structures” of data
  • still the “compression” view of learning
  • too much data => reduce it!
  • clustering: reduce # of examples
  • dimensionality reduction: reduce # of dimensions

8

(a) −2 2 −2 2

slide-9
SLIDE 9

Challenges in Unsupervised Learning

  • how to evaluate the results?
  • there is no gold standard data!
  • internal metric?
  • how to interpret the results?
  • how to “name” the clusters?
  • how to initialize the model/guess?
  • a bad initial guess can lead to very bad results
  • unsup is very sensitive to initialization (unlike supervised)
  • how to do optimization => in general no longer convex!

9

(a) −2 2 −2 2

slide-10
SLIDE 10

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

10

(a) −2 2 −2 2

slide-11
SLIDE 11

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

11

(b) −2 2 −2 2

slide-12
SLIDE 12

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like 1-NN
  • recomputation of centroids based on the new assignment

12

(c) −2 2 −2 2

slide-13
SLIDE 13

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

13

(d) −2 2 −2 2

slide-14
SLIDE 14

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

14

(e) −2 2 −2 2

slide-15
SLIDE 15

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

15

(f) −2 2 −2 2

slide-16
SLIDE 16

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

16

(g) −2 2 −2 2

slide-17
SLIDE 17

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

17

(h) −2 2 −2 2

slide-18
SLIDE 18

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment

18

(i) −2 2 −2 2

slide-19
SLIDE 19

k-means

  • (randomly) pick k points to be initial centroids
  • repeat the two steps until convergence
  • assignment to centroids: voronoi, like NN
  • recomputation of centroids based on the new assignment
  • how to define convergence?
  • after a fixed number of iterations, or
  • assignments do not change, or
  • centroids do not change (equivalent?) or
  • change in objective function falls below threshold

19

(i) −2 2 −2 2

slide-20
SLIDE 20

k-means objective function

  • residual sum of squares (RSS)
  • sum of distances from points to their centroids
  • guaranteed to decrease monotonically
  • convergence proof: decrease + finite # of clusterings

20

J 1 2 3 4 500 1000

slide-21
SLIDE 21

k-means for image segmentation

21

slide-22
SLIDE 22

Problems with k-means

  • problem: sensitive to initialization
  • the objective function is non-convex: many local minima
  • why?
  • k-means works well if
  • clusters are spherical
  • clusters are well separated
  • clusters of similar volumes
  • clusters have similar # of examples

22

slide-23
SLIDE 23

Better (“soft”) k-means?

  • random restarts -- definitely helps
  • soft clusters => EM with Gaussian Mixture Model

23

(i) −2 2 −2 2

x p(x)

0.5 0.3 0.2 (a) 0.5 1 0.5 1 (b) 0.5 1 0.5 1

slide-24
SLIDE 24

k-means

  • randomize k initial centroids
  • repeat the two steps until convergence
  • E-step: assignment each example to centroids (Voronoi)
  • M-step: recomputation of centroids (based on the new assignment)

24

(a) −2 2 −2 2

slide-25
SLIDE 25

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

25

(a) −2 2 −2 2

slide-26
SLIDE 26

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

26

(b) −2 2 −2 2

“fractional assignments”

slide-27
SLIDE 27

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

27

(c) L = 1 −2 2 −2 2

slide-28
SLIDE 28

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

28

(d) L = 2 −2 2 −2 2

slide-29
SLIDE 29

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

29

(e) L = 5 −2 2 −2 2

slide-30
SLIDE 30

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

30

(f) L = 20 −2 2 −2 2

slide-31
SLIDE 31

EM for Gaussian Mixtures

  • randomize k means, covariances, mixing coefficients
  • repeat the two steps until convergence
  • E-step: evaluate the responsibilities using current parameters
  • M-step: reestimate parameters using current responsibilities

31

(f) L = 20 −2 2 −2 2

slide-32
SLIDE 32

EM for Gaussian Mixtures

32

(b) −2 2 −2 2 (c) L = 1 −2 2 −2 2

slide-33
SLIDE 33

Convergence

  • EM converges much slower than k-means
  • can’t use “assignment doesn’t change” for convergence
  • use log likelihood of the data
  • stop if increase in log likelihood smaller than threshold
  • or a maximum # of iterations has reached

33

L = log P(data) = log ΠjP(xj) = X

j

log P(xj) = X

j

log X

i

P(ci)P(xj | ci)

slide-34
SLIDE 34

EM: pros and cons (vs. k-means)

  • EM: pros
  • doesn’t need the data to be spherical
  • doesn’t need the data to be well-separated
  • doesn’t need the clusters to be in similar sizes/volumes
  • EM: cons
  • converges much slower than k-means
  • per-iteration computation also slower
  • (to speedup EM): use k-means to burn-in
  • (same as k-means) local minimum!

34

slide-35
SLIDE 35

k-means is a special case of EM

  • k-means is “hard” EM
  • covariance matrix is diagonal -- i.e., spherical
  • diagonal variances are approaching 0

35

(i) −2 2 −2 2

(f) L = 20 −2 2 −2 2

slide-36
SLIDE 36

CS 562 - EM

Why EM increases p(data) iteratively?

36

slide-37
SLIDE 37

CS 562 - EM

Why EM increases p(data) iteratively?

37

convex auxiliary function converge to local maxima

KL-divergence

= D

slide-38
SLIDE 38

CS 562 - EM

How to maximize the auxiliary?

38

θold θnew

L (q, θ) ln p(X|θ)

slide-39
SLIDE 39

CS 562 - EM

Part II: Dimensionality Reduction

39

slide-40
SLIDE 40

CS 562 - EM

Dimensionality Reduction

40

slide-41
SLIDE 41

CS 562 - EM

Dimensionality Reduction

41

Wrist rotation Fingers extension

from the Isomap paper (Tenebaum, de Silva, Langford, Science 2000)

slide-42
SLIDE 42

CS 562 - EM

Algorithms

  • linear methods
  • PCA - principle ...
  • ICA - independent ...
  • CCA - canonical ...
  • MDS - multidim. scaling
  • LEM - laplacian eignen maps
  • LDA1 - linear discriminant analysis
  • LDA2 - latent dirichlet allocation

42

  • non-linear methods
  • kernelized PCA
  • isomap
  • LLE - locally linear embedding
  • SDE - semidefinite embedding

all are spectral methods! -- i.e., using eigenvalues

slide-43
SLIDE 43

CS 562 - EM

PCA

  • greedily find d orthogonal axes onto which

the variance under projection is maximal

  • the “max variance subspace” formulation
  • 1st PC: direction of greatest variability in data
  • 2nd PC: the next unrelated max-var direction
  • remove all variance in 1st PC, redo max-var
  • another equivalent formulation:

“minimum reconstruction error”

  • find orthogonal vectors onto which the

projection yields min MSE reconstruction

43

slide-44
SLIDE 44

CS 562 - EM

PCA optimization: max-var proj.

  • first translate data to zero mean
  • compute co-variance matrix
  • find top d eigenvalues and eigenvectors of covar matrix
  • project data onto those eigenvectors

44

slide-45
SLIDE 45

CS 562 - EM

PCA for k-means and whitening

  • rescaling to zero mean and unit variance as preprocessing
  • we did that in perceptron HW1/2 also!
  • but PCA can do more: whitening (spherication)

45

slide-46
SLIDE 46

CS 562 - EM

Eigendigits

46

slide-47
SLIDE 47

CS 562 - EM

Eigenfaces

47

slide-48
SLIDE 48

CS 562 - EM

Linear vs. non-Linear

48

LLE or isomap

slide-49
SLIDE 49

CS 562 - EM

Linear vs. non-Linear

49

PCA

slide-50
SLIDE 50

CS 562 - EM

Linear vs. non-Linear

50

PCA

slide-51
SLIDE 51

CS 562 - EM

Linear vs. non-Linear

51

LLE

slide-52
SLIDE 52

CS 562 - EM

Linear vs. non-Linear

52

PCA

slide-53
SLIDE 53

CS 562 - EM

Linear vs. non-Linear

53

LLE

slide-54
SLIDE 54

CS 562 - EM

PCA vs Kernel PCA

54