Clustering: K-Means & Mixture models Prof. Mike Hughes Many - - PowerPoint PPT Presentation

clustering k means mixture models
SMART_READER_LITE
LIVE PREVIEW

Clustering: K-Means & Mixture models Prof. Mike Hughes Many - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Clustering: K-Means & Mixture models Prof. Mike Hughes Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI) 2 What will we learn?


slide-1
SLIDE 1

Clustering: K-Means & Mixture models

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI)

  • Prof. Mike Hughes
slide-2
SLIDE 2

3

Mike Hughes - Tufts COMP 135 - Spring 2019

What will we learn?

Data Examples data x

Supervised Learning Unsupervised Learning Reinforcement Learning

{xn}N

n=1

Task summary

  • f x

Performance measure

slide-3
SLIDE 3

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Clustering

Supervised Learning Unsupervised Learning Reinforcement Learning

clustering

slide-4
SLIDE 4

Clustering: Unit Objectives

  • Understand key challenges
  • How to choose the number of clusters?
  • How to choose the shape of clusters?
  • K-means clustering (deep dive)
  • Shape: Linear Boundaries (nearest Euclidean centroid)
  • Explain algorithm as instance of “coordinate descent”
  • Update some variables while holding others fixed
  • Need smart init and multiple restarts to avoid local optima
  • Mixture models (primer)
  • Advantages of soft assignments and covariances

5

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

Examples of Clustering

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

Clustering Animals by Features

7

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-7
SLIDE 7

Clustering Images

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

Image Compression

9

Mike Hughes - Tufts COMP 135 - Spring 2019

This image on the right achieves a compression factor of around 1 million!

Possible pixel values (R, G, B): 255 * 255 * 255 = 16 million Possible pixel values: One of 16 fixed (R,G,B) values

slide-9
SLIDE 9

10

Mike Hughes - Tufts COMP 135 - Spring 2019

Understanding Genes

slide-10
SLIDE 10

How to cluster these points?

11

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

How to cluster these points?

12

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-12
SLIDE 12

Key Questions

13

Mike Hughes - Tufts COMP 135 - Spring 2019

min

m ∈RF N

X

n=1

(xn − m)T (xn − m)

slide-13
SLIDE 13

K-Means

14

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-14
SLIDE 14

Input:

  • Dataset of N example feature vectors
  • Number of clusters K

15

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-15
SLIDE 15

K-Means Goals

  • Assign each example to one of K clusters
  • Assumption: Clusters are exclusive
  • Minimize Euclidean distance from examples to

cluster centers

  • Assumption: Isotropic Euclidean distance (all

features weighted equally, no covariance modeled) is a good metric for your data

16

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-16
SLIDE 16

K-Means output

  • Centroid Vectors (one per cluster k in 1, … K)
  • Assignments (one per example n in 1 … N)

17

Mike Hughes - Tufts COMP 135 - Spring 2019

One-hot vector indicates which of K clusters example n is assigned to

Length = # features F Real-valued

slide-17
SLIDE 17

Use Euclidean distance

18

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

K-means Optimization Problem

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

K-Means Algorithm

20

Mike Hughes - Tufts COMP 135 - Spring 2019

Initialize cluster means Repeat until converged 1) Update per-example assignment 2) Update per-cluster centroid

For each k in 1:K: Set to mean of data vectors assigned to k For each n in 1:N: Find cluster k* that minimizes Set to indicate k*

slide-20
SLIDE 20

K-Means Algorithm

21

Mike Hughes - Tufts COMP 135 - Spring 2019

Initialize cluster means Repeat until converged 1) Update per-example assignment 2) Update per-cluster centroid

slide-21
SLIDE 21

Updates each improve cost

22

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-22
SLIDE 22

K-Means Algo: Coordinate Ascent

23

Mike Hughes - Tufts COMP 135 - Spring 2019

E-step or per-example step: Update Assignments M-step or per-centroid step: Update Centroid Locations Each step yields cost equal or lower than before

Credit: Jake VanderPlas

slide-23
SLIDE 23

Demo!

http://stanford.edu/class/ee103/visualizations/ kmeans/kmeans.html

24

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

Demo 2 (Choose initial clusters)

https://www.naftaliharris.com/blog/visualizing- k-means-clustering/

25

Mike Hughes - Tufts COMP 135 - Spring 2019

Pick a dataset and fix a K value (e.g. 2 clusters) Can you find a different fixed point solution from your neighbor? What does this mean about the objective?

slide-25
SLIDE 25

K-means Boundaries are Linear

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Decisions when applying k-means

  • How to initialize the clusters?
  • How to choose K?

27

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

Initialization: K-means++

28

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-28
SLIDE 28

Possible Initializations

29

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Draw K random centroid locations
  • Choose K data vectors as centroids
  • Uniformly at random

What can go wrong?

slide-29
SLIDE 29
  • Toy Example: Cluster these 4 points with K=2

30

Mike Hughes - Tufts COMP 135 - Spring 2019

D units

1 units

Example

slide-30
SLIDE 30

No Guarantees on Cost!

31

Mike Hughes - Tufts COMP 135 - Spring 2019

BAD solution. Cost scales with distance D, which could be arbitrarily larger than 1 OPTIMAL solution. Cost scales will be O(1)

slide-31
SLIDE 31

Better init: k-means++

32

Arthur & Vassilvitskii SODA ‘07

Step 1: choose an example uniformly at random as first centroid Repeat for k = 2, 3, … K: Choose example based on distance from nearest centroid

slide-32
SLIDE 32

k-means++: Guarantees on Quality

33

Arthur & Vassilvitskii SODA ‘07

Theorem: This initialization will achieve score that is O(log K) of optimal score.

Step 1: choose an example uniformly at random as first centroid Repeat for k = 2, 3, … K: Choose with probability proportional to distance from nearest centroid

slide-33
SLIDE 33

Use cost to decide among multiple runs of k-means

34

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

How to pick K in K-means?

35

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-35
SLIDE 35

Same data. Which K is best?

36

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-36
SLIDE 36

Use cost function? No!

37

Mike Hughes - Tufts COMP 135 - Spring 2019

At each K, the global optimal cost always decreases. (Local

  • ptima may not)

Limit as K -> N, cost is zero.

slide-37
SLIDE 37

Add complexity penalty!

38

Mike Hughes - Tufts COMP 135 - Spring 2019

Want adding additional clusters to increase cost, if don’t help “enough”

slide-38
SLIDE 38

Computation Issues

39

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-39
SLIDE 39

K-Means Computation

  • Most expensive step: Updating assignments
  • N x K distance calculations
  • Scalable?
  • Don’t need to update all examples, just grab a minibatch
  • Can do stochastic learning rate updates too
  • Parallelizable?
  • Yes. Given fixed centroids, can process minibatches of

examples (the assignment step) in parallel

40

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-40
SLIDE 40

Improved clustering: Gaussian mixture model

41

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-41
SLIDE 41

Improving K-Means

  • Assign each example to one of K clusters
  • Assumption: Clusters are exclusive
  • Improvement: Soft probabilistic assignment
  • Minimize Euclidean distance from examples to

cluster centers

  • Assumption: Isotropic Euclidean distance (all

features weighted equally, no covariance modeled) is a good metric for your data

  • Improvement: Model cluster covariance

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-42
SLIDE 42

Gaussian Mixture Model

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

Gaussian Mixture Model

44

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Mean Vectors (one per cluster k in 1, … K)
  • Covariance Matrix (one per cluster k in 1 … K)
  • Soft assignments (one per example n in 1 … N)

Probabilistic! Vector sums to one

Length = # features F Real-valued F x F square symmetric matrix Positive definite (invertible)

slide-44
SLIDE 44

Covariance Models

45

Mike Hughes - Tufts COMP 135 - Spring 2019

Most similar to k-means More flexible

Credit: Jake VanderPlas

slide-45
SLIDE 45

GMM Training

46

Mike Hughes - Tufts COMP 135 - Spring 2019

Maximize the likelihood of the data

Beyond this course: Can show this looks a lot like K-means’ simplified objective Algorithm: Coordinate ascent! E-step : Update soft assignments r M-step: Update means and covariances

slide-46
SLIDE 46

Special Case

  • K-means is a GMM with:
  • Hard winner-take-all assignments
  • Spherical covariance constraints

47

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-47
SLIDE 47

Clustering: Unit Objectives

  • Understand key challenges
  • How to choose the number of clusters?
  • How to choose the shape of clusters?
  • K-means clustering (deep dive)
  • Shape: Linear Boundaries (nearest Euclidean centroid)
  • Explain algorithm as instance of “coordinate descent”
  • Update some variables while holding others fixed
  • Need smart init and multiple restarts to avoid local optima
  • Mixture models (primer)
  • Advantages of soft assignments and covariances

48

Mike Hughes - Tufts COMP 135 - Spring 2019