 
              CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of Virginia
Clustering
Clustering Clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are separated into different groups [Shalev-Shwartz and Ben-David, 2014, Page 307] 2
Motivation A good clustering can help us understand the data [MacKay, 2003, Chap 20] 3
Movitation(II) A good clustering has predictive power and can be useful to build better classifiers [MacKay, 2003, Chap 20] 4
Motivation (III) Failures of a cluster model may highlight interesting properties of data or a single data point [MacKay, 2003, Chap 20] 5
Challenges ◮ Lack of ground truth — like any other unsupervised learning tasks [Shalev-Shwartz and Ben-David, 2014, Page 307] 6
Challenges ◮ Lack of ground truth — like any other unsupervised learning tasks ◮ Definition of similarity measurement ◮ Two images are similar ◮ Two documents are similar [Shalev-Shwartz and Ben-David, 2014, Page 307] 6
K -Means Clustering
K -Means Clustering ◮ A data set S � { x 1 , . . . , x m } with x i ∈ R d ◮ Partition the data set into some number K of clusters ◮ K is a hyper-parameter given before learning ◮ Another example task of unsupervised learning 8
Objective Function ◮ Introduce r i ∈ [ K ] for each data point x i , which is a determinstric variable ◮ The objective function of k -means clustering m K � � δ ( r i � k )� x i − µ k � 2 J ( r , µ ) � (1) 2 i � 1 k � 1 where { µ k } K k � 1 ∈ R d . Each µ k is called a prototype associated with the k -th cluster. 9
Objective Function ◮ Introduce r i ∈ [ K ] for each data point x i , which is a determinstric variable ◮ The objective function of k -means clustering m K � � δ ( r i � k )� x i − µ k � 2 J ( r , µ ) � (1) 2 i � 1 k � 1 where { µ k } K k � 1 ∈ R d . Each µ k is called a prototype associated with the k -th cluster. ◮ Learning: minimize equation 1 J ( r , µ ) (2) argmin r , µ 9
Learning: Initialization Randomly initialize { µ k } K k � 1 10
Learning: Assignment Step Given { µ k } K k � 1 , for each x i , find the value of r i is equivalent to assign the data point to a cluster � x i − µ k ′ � 2 r i ← argmin (3) 2 k ′ 11
Learning: Update Step Given { r i } m i � 1 , the algorithm updates µ k as � m i � 1 δ ( r i � k ) x i µ k � (4) � m i � 1 δ ( r i � k ) 12 ◮ The updated equals to the mean of all data points
Algorithm With some randomly initialized { µ k } K k � 1 , iterate the following two steps until converge Assignment Step Assign r i for each x i � x i − µ k ′ � 2 r i ← argmin (5) 2 k ′ Update Step Updates µ k with { r i } m i � 1 � m i � 1 δ ( r i � k ) x i µ k � (6) � m i � 1 δ ( r i � k ) 13
Example (Cont.) 14
From GMMs to K -means
Gaussian Mixture Models Consider a GMM with two components q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (7) 16
Gaussian Mixture Models Consider a GMM with two components q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (7) And the marginal probability p ( x ) is q ( x ) q ( z � 1 ) q ( x | z � 1 ) + q ( z � 2 ) q ( x | z � 2 ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (8) � 16
A Special Case Consider the first component in this GMM with parameters µ 1 and Σ 1 ◮ Assume Σ 1 � ǫ I , then ǫ d | Σ 1 | (9) � 1 ( x − µ 1 ) T Σ − 1 ǫ � x − µ � 2 1 ( x − µ ) (10) � 2 17
A Special Case Consider the first component in this GMM with parameters µ 1 and Σ 1 ◮ Assume Σ 1 � ǫ I , then ǫ d | Σ 1 | (9) � 1 ( x − µ 1 ) T Σ − 1 ǫ � x − µ � 2 1 ( x − µ ) (10) � 2 ◮ A Gaussian component can be simplified as 1 − 1 � � 2 ( x i − µ 1 ) T Σ − 1 q ( x i | z i � 1 ) 1 ( x i − µ 1 ) exp � d 1 2 | Σ 1 | ( 2 π ) 2 1 − 1 exp � 2 ǫ � x i − µ 1 � 2 � (11) � 2 d ( 2 πǫ ) 2 17
A Special Case Consider the first component in this GMM with parameters µ 1 and Σ 1 ◮ Assume Σ 1 � ǫ I , then ǫ d | Σ 1 | (9) � 1 ( x − µ 1 ) T Σ − 1 ǫ � x − µ � 2 1 ( x − µ ) (10) � 2 ◮ A Gaussian component can be simplified as 1 − 1 � � 2 ( x i − µ 1 ) T Σ − 1 q ( x i | z i � 1 ) 1 ( x i − µ 1 ) exp � d 1 2 | Σ 1 | ( 2 π ) 2 1 − 1 exp � 2 ǫ � x i − µ 1 � 2 � (11) � 2 d ( 2 πǫ ) 2 ◮ Similar results with the second component with 17 Σ 2 � ǫ I
A Special Case (II) From the previous discussion, we know that, given θ , q ( z i | x i ) is computed as α · N ( x i ; µ 1 , Σ 1 ) q ( z i � 1 | x i ) � α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) � α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) + ( 1 − α ) exp (− 1 2 ǫ � x i − µ 2 � 2 2 18
A Special Case (II) From the previous discussion, we know that, given θ , q ( z i | x i ) is computed as α · N ( x i ; µ 1 , Σ 1 ) q ( z i � 1 | x i ) � α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) � α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) + ( 1 − α ) exp (− 1 2 ǫ � x i − µ 2 � 2 2 ◮ When ǫ → 0 � 1 � x i − µ 1 � 2 < � x i − µ 2 � 2 q ( z i � 1 | x i ) → (12) 0 � x i − µ 1 � 2 > � x i − µ 2 � 2 18
A Special Case (II) From the previous discussion, we know that, given θ , q ( z i | x i ) is computed as α · N ( x i ; µ 1 , Σ 1 ) q ( z i � 1 | x i ) � α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) � α exp (− 1 2 ǫ � x i − µ 1 � 2 2 ) + ( 1 − α ) exp (− 1 2 ǫ � x i − µ 2 � 2 2 ◮ When ǫ → 0 � 1 � x i − µ 1 � 2 < � x i − µ 2 � 2 q ( z i � 1 | x i ) → (12) 0 � x i − µ 1 � 2 > � x i − µ 2 � 2 ◮ r i in K -means is a very special case of z i in GMM 18
When K -means Will Fail? Recall that K -means is an extreme case of GMM with Σ � ǫ I and ǫ → 0 Parameters µ 1 � [ 1 . 5 , 0 ] T µ 2 � [− 1 . 5 , 0 ] T Σ 1 � Σ 2 diag ( 0 . 1 , 10 . 0 ) (13) � 19
When K -means Will Fail? (II) Recall that K -means is an extreme case of GMM with Σ � ǫ I and ǫ → 0 20
How About GMM? With the following setup 1 ◮ Randomly initialize GMM parameters (instead of using K -means to initalize) ◮ Set covariance_type to be tied 1 Please refer to the demo code for more detail 21
Spectral Clustering Instead of computing the distance between data points to some prototypes, spectral clustering is purely based on the similarity between data points, which can address the problem like this [Shalev-Shwartz and Ben-David, 2014, Section 22.3] 22
Reference Bishop, C. M. (2006). Pattern recognition and machine learning . springer. MacKay, D. (2003). Information theory, inference and learning algorithms . Cambridge university press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms . Cambridge university press. 23
Recommend
More recommend