[PPT] - Contents Clustering K-means Mixture of Gaussians Expectation PowerPoint Presentation

SLIDE 1

1

Introduction to Machine Learning CMU-10701

Clustering and EM

Barnabás Póczos & Aarti Singh

SLIDE 3

3

Clustering

SLIDE 4

4

K- means clustering

What is clustering?

Clustering: The process of grouping a set of objects into classes of similar objects –high intra-class similarity –low inter-class similarity –It is the most common form of unsupervised learning

SLIDE 5

5

K- means clustering

What is Similarity?

Hard to define! But we know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach: think in terms of a distance (rather than similarity) between random variables.

SLIDE 6

6

The K- means Clustering Problem

SLIDE 7

7

K-means Clustering Problem

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK} such that the sets minimize the within-cluster sum of squares: K-means clustering problem: K=3

SLIDE 8

8

K-means Clustering Problem

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK} such that the sets minimize the within-cluster sum of squares: The problem is NP hard, but there are good heuristic algorithms that seem to work well in practice:

K–means algorithm
mixture of Gaussians

K-means clustering problem: How hard is this problem?

SLIDE 9

9

K-means Clustering Alg: Step 1

Given n objects.
Guess the cluster centers k1, k2, k3.

(They were µ1,…,µ3 in the previous slide)

SLIDE 10

10

K-means Clustering Alg: Step 2

Build a Voronoi diagram based on the cluster centers k1, k2, k3.
Decide the class memberships of the n objects by assigning them to the

nearest cluster centers k1, k2, k3.

SLIDE 11

11

K-means Clustering Alg: Step 3

Re-estimate the cluster centers (aka the centroid or mean), by

assuming the memberships found above are correct.

SLIDE 12

12

K-means Clustering Alg: Step 4

Build a new Voronoi diagram.
Decide the class memberships of the n objects based on this diagram

SLIDE 13

13

K-means Clustering Alg: Step 5

Re-estimate the cluster centers.

SLIDE 14

14

K-means Clustering Alg: Step 6

Stop when everything is settled.

(The Voronoi diagrams don’t change anymore)

SLIDE 15

15

K- means clustering

Algorithm Input – Data + Desired number of clusters, K Initialize – the K cluster centers (randomly if necessary) Iterate

1. Decide the class memberships of the n objects by assigning them to the

nearest cluster centers

2. Re-estimate the K cluster centers (aka the centroid or mean), by

assuming the memberships found above are correct. Termination – If none of the n objects changed membership in the last iteration, exit. Otherwise go to 1.

K- means Clustering Algorithm

SLIDE 16

16

K- means clustering

K- means Algorithm Computation Complexity

At each iteration, – Computing distance between each of the n objects and the K cluster centers is O(Kn). – Computing cluster centers: Each object gets added once to some cluster: O(n). Assume these two steps are each done once for l iterations: O(lKn). Can you prove that the K-means algorithm guaranteed to terminate?

SLIDE 17

17

K- means clustering

Seed Choice

SLIDE 18

18

K- means clustering

Seed Choice

SLIDE 19

19

K- means clustering

Seed Choice

The results of the K- means Algorithm can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. K-means algorithm can get stuck easily in local minima. – Select good seeds using a heuristic (e.g., object least similar to any existing mean) – Try out multiple starting points (very important!!!) – Initialize with the results of another method.

SLIDE 20

20

Alternating Optimization

SLIDE 21

21

K- means clustering

K- means Algorithm (more formally)

Randomly initialize k centers Classify: At iteration t, assign each point (j ∈ {1,…,n}) to nearest center: Recenter: µi is the centroid of the new sets: Classification at iteration t Re-assign new cluster centers at iteration t

SLIDE 22

22

K- means clustering

What is K-means optimizing?

Define the following potential function F of centers µ and point allocation C Optimal solution of the K-means problem:

Two equivalent versions

SLIDE 23

23

K- means clustering

K-means Algorithm

K-means algorithm: (1)

Optimize the potential function:

(2) Exactly 2nd step (re-center)

Assign each point to the nearest cluster center Exactly first step

SLIDE 24

24

K- means clustering

K-means Algorithm

K-means algorithm: (coordinate descent on F) Today, we will see a generalization of this approach: EM algorithm (1) (2) Expectation step Maximization step

Optimize the potential function:

SLIDE 25

25

Gaussian Mixture Model

SLIDE 26

26

Density Estimation

There is a latent parameter Θ
For all i, draw observed xi given Θ

Generative approach

⇒ Mixture modelling, Partitioning algorithms Different parameters for different parts of the domain. What if the basic model doesn’t fit all data?

SLIDE 27

27

K- means clustering

Partitioning Algorithms

K-means

–hard assignment: each object belongs to only one cluster

Mixture modeling

–soft assignment: probability that an object belongs to a cluster

SLIDE 28

28

K- means clustering

Gaussian Mixture Model

Mixture of K Gaussians distributions: (Multi-modal distribution)

There are K components
Component i has an associated mean vector µi

Component i generates data from Each data point is generated using this process:

SLIDE 29

29

Gaussian Mixture Model

Mixture of K Gaussians distributions: (Multi-modal distribution) Mixture component Mixture proportion Observed data Hidden variable

SLIDE 30

30

Mixture of Gaussians Clustering

Cluster x based on posteriors: Assume that For a given x we want to decide if it belongs to cluster i or cluster j

SLIDE 31

31

Mixture of Gaussians Clustering

Assume that

SLIDE 32

32

Piecewise linear decision boundary

SLIDE 33

33

MLE for GMM

⇒ ⇒ ⇒ ⇒ Maximum Likelihood Estimate (MLE) What if we don't know the parameters?

SLIDE 34

34

K-means and GMM

What happens if we assume hard assignment?

P(yj = i) = 1 if i = C(j) = 0 otherwise In this case the MLE estimation: Same as K-means!!! MLE:

SLIDE 35

35

General GMM

There are k components
Component i has an associated

mean vector µi

Each component generates data

from a Gaussian with mean µi and covariance matrix Σi. Each data point is generated according to the following recipe: General GMM –Gaussian Mixture Model (Multi-modal distribution) 1) Pick a component at random: Choose component i with probability P(y=i) 2) Datapoint x~ N(µi ,Σi)

SLIDE 36

36

General GMM

GMM –Gaussian Mixture Model (Multi-modal distribution) Mixture component Mixture proportion

SLIDE 37

37

General GMM

“Quadratic Decision boundary” – second-order terms don’t cancel out Clustering based on posteriors: Assume that

SLIDE 38

38

General GMM MLE Estimation

⇒ ⇒ ⇒ ⇒ Maximize marginal likelihood (MLE):

What if we don't know

Doable, but often slow Non-linear, non-analytically solvable

SLIDE 39

39

Expectation-Maximization (EM)

A general algorithm to deal with hidden data, but we will study it in the context of unsupervised learning (hidden class labels = clustering) first.

EM is an optimization strategy for objective functions that can be interpreted

as likelihoods in the presence of missing data.

EM is “simpler” than gradient methods:

No need to choose step size.

EM is an iterative algorithm with two linked steps:
E-step: fill-in hidden values using inference
M-step: apply standard MLE/MAP method to completed data
We will prove that this procedure monotonically improves the likelihood (or

leaves it unchanged). EM always converges to a local optimum of the likelihood.

SLIDE 40

40

Expectation-Maximization (EM)

A simple case:

We have unlabeled data x1, x2, …, xn
We know there are K classes
We know P(y=1)=π1, P(y=2)=π2 P(y=3) … P(y=K)=πK
We know common variance σ2
We don’t know µ1, µ2, … µK , and we want to learn them

We can write

Marginalize over class

Independent data ⇒ learn µ1, µ2, … µK

SLIDE 41

41

Expectation (E) step

Equivalent to assigning clusters to each data point in K-means in a soft way At iteration t, construct function Q: We want to learn: Our estimator at the end of iteration t-1: E step

SLIDE 42

42

Maximization (M) step

Equivalent to updating cluster centers in K-means

We calculated these weights in the E step Joint distribution is simple

At iteration t, maximize function Q in θt: M step

SLIDE 43

43

EM for spherical, same variance GMMs

E-step Compute “expected” classes of all datapoints for each class In K-means “E-step” we do hard assignment. EM does soft assignment M-step Compute Max of function Q. [I.e. update µ given our data’s class membership distributions (weights) ]

Iterate. Exactly the same as MLE with weighted data.

SLIDE 44

44

EM for general GMMs

The more general case:

We have unlabeled data x1, x2, …, xm
We know there are K classes
We don’t know P(y=1)=π1, P(y=2)=π2 P(y=3) … P(y=K)=πK
We don’t know Σ1,… ΣK
We don’t know µ1, µ2, … µK

The idea is the same: At iteration t, construct function Q (E step) and maximize it in θt (M step) We want to learn: Our estimator at the end of iteration t-1:

SLIDE 45

45

EM for general GMMs

At iteration t, construct function Q (E step) and maximize it in θt (M step) M-step Compute MLEs given our data’s class membership distributions (weights) E-step Compute “expected” classes of all datapoints for each class

SLIDE 46

46

EM for general GMMs: Example

SLIDE 47

47

EM for general GMMs: Example

After 1st iteration

SLIDE 48

48

EM for general GMMs: Example

After 2nd iteration

SLIDE 49

49

EM for general GMMs: Example

After 3rd iteration

SLIDE 50

50

EM for general GMMs: Example

After 4th iteration

SLIDE 51

51

EM for general GMMs: Example

After 5th iteration

SLIDE 52

52

EM for general GMMs: Example

After 6th iteration

SLIDE 53

53

EM for general GMMs: Example

After 20th iteration

SLIDE 54

54

GMM for Density Estimation

SLIDE 55

55

General EM algorithm

What is EM in the general case, and why does it work?

SLIDE 56

56

General EM algorithm

Observed data: Unknown variables: Paramaters: For example in clustering: For example in MoG: Goal: Notation

SLIDE 57

57

General EM algorithm

Observed data: Unknown variables: Paramaters: Goal: Other Examples: Hidden Markov Models Initial probabilities: Transition probabilities: Emission probabilities:

SLIDE 58

58

General EM algorithm

Goal:

Free energy:

E Step: M Step:

We are going to discuss why this approach works

SLIDE 59

59

General EM algorithm

Free energy:

M Step:

We maximize only here in θ!!!

E Step:

Let us see why!

SLIDE 60

60

General EM algorithm

Free energy: Theorem: During the EM algorithm the marginal likelihood is not decreasing! Proof:

SLIDE 61

61

General EM algorithm

Goal: E Step: M Step:

During the EM algorithm the marginal likelihood is not decreasing!

SLIDE 62

62

Convergence of EM

Sequence of EM lower bound F-functions EM monotonically converges to a local maximum of likelihood !

SLIDE 63

63

Convergence of EM

Use multiple, randomized initializations in practice Different sequence of EM lower bound F-functions depending on initialization

SLIDE 64

64

Variational Methods

SLIDE 65

65

Variational methods

Free energy: Variational methods might decrease the marginal likelihood!

SLIDE 66

66

Variational methods

Free energy:

Partial M Step: Partial E Step:

But not necessarily the best max/min which would be Variational methods might decrease the marginal likelihood!

SLIDE 67

67

Summary: EM Algorithm

A way of maximizing likelihood function for hidden variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1.Estimate some “missing” or “unobserved” data from observed data and current parameters.

2. Using this “complete” data, find the MLE parameter estimates.

Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: In the M-step we optimize a lower bound F on the likelihood L. In the E-step we close the gap, making bound F =likelihood L. EM performs coordinate ascent on F, can get stuck in local optima.

Contents

Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods

Introduction to Machine Learning CMU-10701

Clustering and EM

Barnabás Póczos & Aarti Singh

Clustering

K- means clustering

What is clustering?

K- means clustering

What is Similarity?

The K- means Clustering Problem

K-means Clustering Problem

K-means Clustering Problem

K-means Clustering Alg: Step 1

K-means Clustering Alg: Step 2

K-means Clustering Alg: Step 3

K-means Clustering Alg: Step 4

K-means Clustering Alg: Step 5

K-means Clustering Alg: Step 6

K- means clustering

K- means Clustering Algorithm

K- means clustering

K- means Algorithm Computation Complexity

K- means clustering

Seed Choice

K- means clustering

Seed Choice

K- means clustering

Seed Choice

Alternating Optimization

K- means clustering

K- means Algorithm (more formally)

K- means clustering

What is K-means optimizing?

Define the following potential function F of centers µ and point allocation C Optimal solution of the K-means problem:

K- means clustering

K-means Algorithm

Optimize the potential function:

K- means clustering

K-means Algorithm

Optimize the potential function:

Gaussian Mixture Model

Density Estimation

Generative approach

K- means clustering

Partitioning Algorithms

K- means clustering

Gaussian Mixture Model

Gaussian Mixture Model

Mixture of Gaussians Clustering

Mixture of Gaussians Clustering

Piecewise linear decision boundary

MLE for GMM

K-means and GMM

General GMM

General GMM

General GMM

General GMM MLE Estimation

What if we don't know

Expectation-Maximization (EM)

Expectation-Maximization (EM)

Expectation (E) step

Maximization (M) step

EM for spherical, same variance GMMs

EM for general GMMs

EM for general GMMs

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

EM for general GMMs: Example

GMM for Density Estimation

General EM algorithm

General EM algorithm

General EM algorithm

General EM algorithm

Goal: