CS 6316 Machine Learning Clustering Yangfeng Ji Department of - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Clustering Yangfeng Ji Department of - - PowerPoint PPT Presentation

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of Virginia Clustering Clustering Clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar


slide-1
SLIDE 1

CS 6316 Machine Learning

Clustering

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Clustering

slide-3
SLIDE 3

Clustering

Clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar

  • bjects are separated into different groups

[Shalev-Shwartz and Ben-David, 2014, Page 307]

2

slide-4
SLIDE 4

Motivation

A good clustering can help us understand the data [MacKay, 2003, Chap 20]

3

slide-5
SLIDE 5

Movitation(II)

A good clustering has predictive power and can be useful to build better classifiers [MacKay, 2003, Chap 20]

4

slide-6
SLIDE 6

Motivation (III)

Failures of a cluster model may highlight interesting properties of data or a single data point [MacKay, 2003, Chap 20]

5

slide-7
SLIDE 7

Challenges

◮ Lack of ground truth — like any other unsupervised

learning tasks [Shalev-Shwartz and Ben-David, 2014, Page 307]

6

slide-8
SLIDE 8

Challenges

◮ Lack of ground truth — like any other unsupervised

learning tasks

◮ Definition of similarity measurement

◮ Two images are similar ◮ Two documents are similar [Shalev-Shwartz and Ben-David, 2014, Page 307]

6

slide-9
SLIDE 9

K-Means Clustering

slide-10
SLIDE 10

K-Means Clustering

◮ A data set S {x1, . . . , xm} with xi ∈ Rd ◮ Partition the data set into some number K of clusters ◮ K is a hyper-parameter given before learning ◮ Another example task of unsupervised learning

8

slide-11
SLIDE 11

Objective Function

◮ Introduce ri ∈ [K] for each data point xi, which is a

determinstric variable

◮ The objective function of k-means clustering

J(r, µ)

m

  • i1

K

  • k1

δ(ri k)xi − µk2

2

(1) where {µk}K

k1 ∈ Rd. Each µk is called a prototype

associated with the k-th cluster.

9

slide-12
SLIDE 12

Objective Function

◮ Introduce ri ∈ [K] for each data point xi, which is a

determinstric variable

◮ The objective function of k-means clustering

J(r, µ)

m

  • i1

K

  • k1

δ(ri k)xi − µk2

2

(1) where {µk}K

k1 ∈ Rd. Each µk is called a prototype

associated with the k-th cluster.

◮ Learning: minimize equation 1

argmin

r,µ

J(r, µ) (2)

9

slide-13
SLIDE 13

Learning: Initialization

Randomly initialize {µk}K

k1

10

slide-14
SLIDE 14

Learning: Assignment Step

Given {µk}K

k1, for each xi, find the value of ri is

equivalent to assign the data point to a cluster ri ← argmin

k′

xi − µk′2

2

(3)

11

slide-15
SLIDE 15

Learning: Update Step

Given {ri}m

i1, the algorithm updates µk as

µk

m

i1 δ(ri k)xi

m

i1 δ(ri k)

(4)

◮ The updated

equals to the mean of all data points

12

slide-16
SLIDE 16

Algorithm

With some randomly initialized {µk}K

k1, iterate the

following two steps until converge Assignment Step Assign ri for each xi ri ← argmin

k′

xi − µk′2

2

(5) Update Step Updates µk with {ri}m

i1

µk

m

i1 δ(ri k)xi

m

i1 δ(ri k)

(6)

13

slide-17
SLIDE 17

Example (Cont.)

14

slide-18
SLIDE 18

From GMMs to K-means

slide-19
SLIDE 19

Gaussian Mixture Models

Consider a GMM with two components q(x, z)

  • q(z)q(x | z)
  • αδ(z1) · N(x; µ1, Σ1)δ(z1)

· (1 − α)δ(z2) · N(x; µ2, Σ2)δ(z2) (7)

16

slide-20
SLIDE 20

Gaussian Mixture Models

Consider a GMM with two components q(x, z)

  • q(z)q(x | z)
  • αδ(z1) · N(x; µ1, Σ1)δ(z1)

· (1 − α)δ(z2) · N(x; µ2, Σ2)δ(z2) (7) And the marginal probability p(x) is q(x)

  • q(z 1)q(x | z 1) + q(z 2)q(x | z 2)
  • α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2)

(8)

16

slide-21
SLIDE 21

A Special Case

Consider the first component in this GMM with parameters µ1 and Σ1

◮ Assume Σ1 ǫI, then

|Σ1|

  • ǫd

(9) (x − µ1)TΣ−1

1 (x − µ)

  • 1

ǫ x − µ2

2

(10)

17

slide-22
SLIDE 22

A Special Case

Consider the first component in this GMM with parameters µ1 and Σ1

◮ Assume Σ1 ǫI, then

|Σ1|

  • ǫd

(9) (x − µ1)TΣ−1

1 (x − µ)

  • 1

ǫ x − µ2

2

(10)

◮ A Gaussian component can be simplified as

q(xi | zi 1)

  • 1

(2π)

d 2 |Σ1| 1 2

exp

  • − 1

2(xi − µ1)TΣ−1

1 (xi − µ1)

  • 1

(2πǫ)

d 2

exp − 1 2ǫ xi − µ12

2

  • (11)

17

slide-23
SLIDE 23

A Special Case

Consider the first component in this GMM with parameters µ1 and Σ1

◮ Assume Σ1 ǫI, then

|Σ1|

  • ǫd

(9) (x − µ1)TΣ−1

1 (x − µ)

  • 1

ǫ x − µ2

2

(10)

◮ A Gaussian component can be simplified as

q(xi | zi 1)

  • 1

(2π)

d 2 |Σ1| 1 2

exp

  • − 1

2(xi − µ1)TΣ−1

1 (xi − µ1)

  • 1

(2πǫ)

d 2

exp − 1 2ǫ xi − µ12

2

  • (11)

◮ Similar results with the second component with

Σ2 ǫI

17

slide-24
SLIDE 24

A Special Case (II)

From the previous discussion, we know that, given θ, q(zi | xi) is computed as

q(zi 1 | xi)

  • α · N(xi; µ1, Σ1)

α · N(xi; µ1, Σ1) + (1 − α) · N(xi; µ2, Σ2)

  • α exp(− 1

2ǫ xi − µ12 2)

α exp(− 1

2ǫ xi − µ12 2) + (1 − α) exp(− 1 2ǫ xi − µ22 2 18

slide-25
SLIDE 25

A Special Case (II)

From the previous discussion, we know that, given θ, q(zi | xi) is computed as

q(zi 1 | xi)

  • α · N(xi; µ1, Σ1)

α · N(xi; µ1, Σ1) + (1 − α) · N(xi; µ2, Σ2)

  • α exp(− 1

2ǫ xi − µ12 2)

α exp(− 1

2ǫ xi − µ12 2) + (1 − α) exp(− 1 2ǫ xi − µ22 2

◮ When ǫ → 0

q(zi 1 | xi) →

1

xi − µ12 < xi − µ22 xi − µ12 > xi − µ22 (12)

18

slide-26
SLIDE 26

A Special Case (II)

From the previous discussion, we know that, given θ, q(zi | xi) is computed as

q(zi 1 | xi)

  • α · N(xi; µ1, Σ1)

α · N(xi; µ1, Σ1) + (1 − α) · N(xi; µ2, Σ2)

  • α exp(− 1

2ǫ xi − µ12 2)

α exp(− 1

2ǫ xi − µ12 2) + (1 − α) exp(− 1 2ǫ xi − µ22 2

◮ When ǫ → 0

q(zi 1 | xi) →

1

xi − µ12 < xi − µ22 xi − µ12 > xi − µ22 (12)

◮ ri in K-means is a very special case of zi in GMM

18

slide-27
SLIDE 27

When K-means Will Fail?

Recall that K-means is an extreme case of GMM with Σ ǫI and ǫ → 0 Parameters µ1 [1.5, 0]T µ2 [−1.5, 0]T Σ1 Σ2

  • diag(0.1, 10.0)

(13)

19

slide-28
SLIDE 28

When K-means Will Fail? (II)

Recall that K-means is an extreme case of GMM with Σ ǫI and ǫ → 0

20

slide-29
SLIDE 29

How About GMM?

With the following setup1

◮ Randomly initialize GMM parameters (instead of

using K-means to initalize)

◮ Set covariance_type to be tied

1Please refer to the demo code for more detail

21

slide-30
SLIDE 30

Spectral Clustering

Instead of computing the distance between data points to some prototypes, spectral clustering is purely based on the similarity between data points, which can address the problem like this [Shalev-Shwartz and Ben-David, 2014, Section 22.3]

22

slide-31
SLIDE 31

Reference

Bishop, C. M. (2006). Pattern recognition and machine learning. springer. MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge university press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

23