Introduction to Machine Learning CMU-10701 19. Clustering and EM - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 19. Clustering and EM - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods Many of these slides are taken from Aarti


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

  • 19. Clustering and EM

Barnabás Póczos

slide-2
SLIDE 2

2

Contents

 Clustering K-means Mixture of Gaussians  Expectation Maximization  Variational Methods

Many of these slides are taken from

  • Aarti Singh,
  • Eric Xing,
  • Carlos Guetrin
slide-3
SLIDE 3

3

Clustering

slide-4
SLIDE 4

4

K- means clustering

What is clustering?

Clustering:

The process of grouping a set of objects into classes of similar objects –high intra-class similarity –low inter-class similarity –It is the commonest form of unsupervised learning

slide-5
SLIDE 5

5

K- means clustering

What is Similarity?

Hard to define! But we know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach: think in terms of a distance (rather than similarity) between random variables.

slide-6
SLIDE 6

6

The K- means Clustering Problem

slide-7
SLIDE 7

7

K-means Clustering Problem

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK} such that the sets minimize the within-cluster sum of squares:

K-means clustering problem: K= 3

slide-8
SLIDE 8

8

K-means Clustering Problem

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK} such that the sets minimize the within-cluster sum of squares:

The problem is NP hard, but there are good heuristic algorithms that seem to work well in practice:

  • K–means algorithm
  • mixture of Gaussians

K-means clustering problem:

How hard is this problem?

slide-9
SLIDE 9

9

K-means Clustering Alg: Step 1

  • Given n objects.
  • Guess the cluster centers k1, k2, k3.

(They were µ1,…,µ3 in the previous slide)

slide-10
SLIDE 10

10

K-means Clustering Alg: Step 2

  • Build a Voronoi diagram based on the cluster centers k1, k2, k3.
  • Decide the class memberships of the n objects by assigning them to the

nearest cluster centers k1, k2, k3.

slide-11
SLIDE 11

11

K-means Clustering Alg: Step 3

  • Re-estimate the cluster centers (aka the centroid or mean), by

assuming the memberships found above are correct.

slide-12
SLIDE 12

12

K-means Clustering Alg: Step 4

  • Build a new Voronoi diagram.
  • Decide the class memberships of the n objects based on this diagram
slide-13
SLIDE 13

13

K-means Clustering Alg: Step 5

  • Re-estimate the cluster centers.
slide-14
SLIDE 14

14

K-means Clustering Alg: Step 6

  • Stop when everything is settled.

(The Voronoi diagrams don’t change anymore)

slide-15
SLIDE 15

15

K- means clustering

Algorithm

Input – Data + Desired number of clusters, K Initialize – the K cluster centers (randomly if necessary) Iterate

  • 1. Decide the class memberships of the n objects by assigning them to the

nearest cluster centers

  • 2. Re-estimate the K cluster centers (aka the centroid or mean), by

assuming the memberships found above are correct. Termination – If none of the n objects changed membership in the last iteration, exit. Otherwise go to 1.

K- means Clustering Algorithm

slide-16
SLIDE 16

16

K- means clustering

K- means Algorithm Computation Complexity

 At each iteration,

– Computing distance between each of the n objects and the K cluster centers is O(Kn). – Computing cluster centers: Each object gets added once to some cluster: O(n).

 Assume these two steps are each done once for l iterations: O(lKn).

Can you prove that the K-means algorithm guaranteed to terminate?

slide-17
SLIDE 17

17

K- means clustering

Seed Choice

slide-18
SLIDE 18

18

K- means clustering

Seed Choice

slide-19
SLIDE 19

19

K- means clustering

Seed Choice

The results of the K- means Algorithm can vary based on random seed selection.

 Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering.  K-means algorithm can get stuck easily in local minima.

– Select good seeds using a heuristic (e.g., object least similar to any existing mean) – Try out multiple starting points (very important!!!) – Initialize with the results of another method.

slide-20
SLIDE 20

20

Alternating Optimization

slide-21
SLIDE 21

21

K- means clustering

K- means Algorithm (more formally)

 Randomly initialize k centers  Classify: At iteration t, assign each point j 2 { 1,…,n} to nearest center:  Recenter: µi is the centroid of the new sets:

Classification at iteration t Re-assign new cluster centers at iteration t

slide-22
SLIDE 22

22

K- means clustering

What is K-means optimizing?

 Define the following potential function F of centers µ and

point allocation C

 Optimal solution of the K-means problem:

Two equivalent versions

slide-23
SLIDE 23

23

K- means clustering

K-means Algorithm

K-means algorithm: (1)

Optimize the potential function:

(2) Exactly 2nd step (re-center)

Assign each point to the nearest cluster center Exactly first step

slide-24
SLIDE 24

24

K- means clustering

K-means Algorithm

K-means algorithm: (coordinate descent on F)

Today, we will see a generalization of this approach:

EM algorithm (1) (2) Expectation step Maximization step

Optimize the potential function:

slide-25
SLIDE 25

25

Gaussian Mixture Model

slide-26
SLIDE 26

26

Density Estimation

xi Θ

  • There is a latent parameter Θ
  • For all i, draw observed xi given Θ

Generative approach

) Mixture modelling, Partitioning algorithms

Different parameters for different parts of the domain.

What if the basic model doesn’t fit all data?

slide-27
SLIDE 27

27

K- means clustering

Partitioning Algorithms

  • K-means

–hard assignment: each object belongs to only one cluster

  • Mixture modeling

–soft assignment: probability that an object belongs to a cluster

slide-28
SLIDE 28

28

K- means clustering

Gaussian Mixture Model

Mixture of K Gaussians distributions: (Multi-modal distribution)

  • There are K components
  • Component i has an associated mean vector µi

Component i generates data from

Each data point is generated using this process:

slide-29
SLIDE 29

29

Gaussian Mixture Model

Mixture of K Gaussians distributions: (Multi-modal distribution) Mixture component Mixture proportion Observed data Hidden variable

slide-30
SLIDE 30

30

Mixture of Gaussians Clustering

“Linear Decision boundary” – Since the second-order terms cancel out Cluster x based on posteriors: Assume that

slide-31
SLIDE 31

31

MLE for GMM

) ) Maximum Likelihood Estimate (MLE)

What if we don't know

slide-32
SLIDE 32

32

K-means and GMM

  • Assume data comes from a mixture of K Gaussians distributions with

same variance σ2

  • Assume Hard assignment:

P(yj = i) = 1 if i = C(j) = 0 otherwise

Maximize marginal likelihood (MLE): Same as K-means!!!

slide-33
SLIDE 33

33

General GMM

  • There are k components
  • Component i has an associated

mean vector µI

  • Each component generates data

from a Gaussian with mean µ i and covariance matrix Σi. Each data point is generated according to the following recipe:

General GMM –Gaussian Mixture Model (Multi-modal distribution)

1) Pick a component at random: Choose component i with probability P(y= i) 2) Datapoint x~ N(µI ,Σi)

slide-34
SLIDE 34

34

General GMM

GMM –Gaussian Mixture Model (Multi-modal distribution) Mixture component Mixture proportion

slide-35
SLIDE 35

35

General GMM

“Quadratic Decision boundary” – second-order terms don’t cancel out Clustering based on posteriors: Assume that

slide-36
SLIDE 36

36

General GMM MLE Estimation

) ) Maximize marginal likelihood (MLE):

What if we don't know

Doable, but often slow Non-linear, non-analytically solvable

slide-37
SLIDE 37

37

Expectation-Maximization (EM)

A general algorithm to deal with hidden data, but we will study it in the context of unsupervised learning (hidden class labels = clustering) first.

  • EM is an optimization strategy for objective functions that can be interpreted

as likelihoods in the presence of missing data.

  • EM is much simpler than gradient methods:

No need to choose step size.

  • EM is an iterative algorithm with two linked steps:
  • E-step: fill-in hidden values using inference
  • M-step: apply standard MLE/MAP method to completed data
  • We will prove that this procedure monotonically improves the likelihood (or

leaves it unchanged). EM always converges to a local optimum of the likelihood.

slide-38
SLIDE 38

38

Expectation-Maximization (EM)

A simple case:

  • We have unlabeled data x1, x2, …, xm
  • We know there are K classes
  • We know P(y= 1)= π1, P(y= 2)= π2 P(y= 3) … P(y= K)= πK
  • We know common variance σ2
  • We don’t know µ1, µ2, … µK , and we want to learn them

We can write

Marginalize over class

Independent data

) learn µ1, µ2, … µK

slide-39
SLIDE 39

39

Expectation (E) step

Equivalent to assigning clusters to each data point in K-means in a soft way At iteration t, construct function Q: We want to learn: Our estimator at the end of iteration t-1:

E step

slide-40
SLIDE 40

40

Maximization (M) step

Equivalent to updating cluster centers in K-means

We calculated these weights in the E step Joint distribution is simple

At iteration t, maximize function Q in θt:

M step

slide-41
SLIDE 41

41

EM for spherical, same variance GMMs

E-step

Compute “expected” classes of all datapoints for each class In K-means “E-step” we do hard assignment. EM does soft assignment

M-step

Compute Max. like μ given our data’s class membership distributions (weights)

  • Iterate. Exactly the same as MLE with weighted data.
slide-42
SLIDE 42

42

EM for general GMMs

The more general case:

  • We have unlabeled data x1, x2, …, xm
  • We know there are K classes
  • We don’t know P(y= 1)= π1, P(y= 2)= π2 P(y= 3) … P(y= K)= πK
  • We don’t know Σ1,… ΣK
  • We don’t know µ1, µ2, … µK

The idea is the same:

At iteration t, construct function Q (E step) and maximize it in θt (M step)

We want to learn:

Our estimator at the end of iteration t-1:

slide-43
SLIDE 43

43

EM for general GMMs

At iteration t, construct function Q (E step) and maximize it in θt (M step)

M-step

Compute MLEs given our data’s class membership distributions (weights)

E-step

Compute “expected” classes of all datapoints for each class

slide-44
SLIDE 44

44

EM for general GMMs: Example

slide-45
SLIDE 45

45

EM for general GMMs: Example

After 1st iteration

slide-46
SLIDE 46

46

EM for general GMMs: Example

After 2nd iteration

slide-47
SLIDE 47

47

EM for general GMMs: Example

After 3rd iteration

slide-48
SLIDE 48

48

EM for general GMMs: Example

After 4th iteration

slide-49
SLIDE 49

49

EM for general GMMs: Example

After 5th iteration

slide-50
SLIDE 50

50

EM for general GMMs: Example

After 6th iteration

slide-51
SLIDE 51

51

EM for general GMMs: Example

After 20th iteration

slide-52
SLIDE 52

52

GMM for Density Estimation

slide-53
SLIDE 53

53

General EM algorithm

What is EM in the general case, and why does it work?

slide-54
SLIDE 54

54

General EM algorithm

Observed data: Unknown variables: Paramaters:

For example in clustering: For example in MoG:

Goal: Notation

slide-55
SLIDE 55

55

General EM algorithm

Observed data: Unknown variables: Paramaters: Goal: Other Examples: Hidden Markov Models

Initial probabilities: Transition probabilities: Emission probabilities:

slide-56
SLIDE 56

56

General EM algorithm

Goal:

Free energy:

E Step: M Step:

slide-57
SLIDE 57

57

General EM algorithm

Free energy:

M Step:

We maximize only here in θ!!!

E Step:

slide-58
SLIDE 58

58

General EM algorithm

Free energy:

Theorem: During the EM algorithm the marginal likelihood is not decreasing! Proof:

slide-59
SLIDE 59

59

General EM algorithm

Goal: E Step: M Step:

During the EM algorithm the marginal likelihood is not decreasing!

slide-60
SLIDE 60

60

Convergence of EM

Sequence of EM lower bound F-functions

EM monotonically converges to a local maximum of likelihood !

slide-61
SLIDE 61

61

Convergence of EM

Use multiple, randomized initializations in practice

Different sequence of EM lower bound F-functions depending on initialization

slide-62
SLIDE 62

62

Variational Methods

slide-63
SLIDE 63

63

Variational methods

Free energy: Variational methods might decrease the marginal likelihood!

slide-64
SLIDE 64

64

Variational methods

Free energy:

Partial M Step: Partial E Step:

But not necessarily the best max/min which would be Variational methods might decrease the marginal likelihood!

slide-65
SLIDE 65

65

Summary: EM Algorithm

A way of maximizing likelihood function for hidden variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1.Estimate some “missing” or “unobserved” data from observed data and current parameters.

  • 2. Using this “complete” data, find the MLE parameter estimates.

Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: In the M-step we optimize a lower bound F on the likelihood L. In the E-step we close the gap, making bound F = likelihood L. EM performs coordinate ascent on F , can get stuck in local optima.

E Step: M Step: