coms 4721 machine learning for data science lecture 16 3
play

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University S OFT CLUSTERING VS HARD CLUSTERING MODELS H ARD CLUSTERING MODELS


  1. COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. S OFT CLUSTERING VS HARD CLUSTERING MODELS

  3. H ARD CLUSTERING MODELS Review: K-means clustering algorithm Given: Data x 1 , . . . , x n , where x ∈ R d Goal: Minimize L = � n � K k = 1 1 { c i = k }� x i − µ k � 2 . i = 1 ◮ Iterate until values no longer changing 1. Update c : For each i , set c i = arg min k � x i − µ k � 2 �� � �� � 2. Update µ : For each k , set µ k = i x i 1 { c i = k } / i 1 { c i = k } K-means is an example of a hard clustering algorithm because it assigns each observation to only one cluster. In other words, c i = k for some k ∈ { 1 , . . . , K } . There is no accounting for the “boundary cases” by hedging on the corresponding c i .

  4. S OFT CLUSTERING MODELS A soft clustering algorithm breaks the data across clusters intelligently. 1 1 1 (a) (b) (c) 0.5 0.5 0.5 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 (left) True cluster assignments of data from three Gaussians. (middle) The data as we see it. (right) A soft-clustering of the data accounting for borderline cases.

  5. W EIGHTED K- MEANS ( SOFT CLUSTERING EXAMPLE ) Weighted K-means clustering algorithm Given: Data x 1 , . . . , x n , where x ∈ R d Goal: Minimize L = � n � K k = 1 φ i ( k ) � x i − µ k � 2 − � i H ( φ i ) over φ i and µ k i = 1 β Conditions: φ i ( k ) > 0 , � K k = 1 φ i ( k ) = 1, H ( φ i ) = entropy. Set β > 0. ◮ Iterate the following 1. Update φ : For each i , update the cluster allocation weights exp {− 1 β � x i − µ k � 2 } φ i ( k ) = β � x i − µ j � 2 } , for k = 1 , . . . , K j exp {− 1 � 2. Update µ : For each k , update µ k with the weighted average � i x i φ i ( k ) µ k = � i φ i ( k )

  6. S OFT CLUSTERING WITH WEIGHTED K - MEANS ϕ i = 0.75 on green cluster & 0.25 blue cluster ϕ i = 1 on blue cluster x When ϕ i is binary, we get back the hard clustering model β-defined region μ 1

  7. M IXTURE MODELS

  8. P ROBABILISTIC SOFT CLUSTERING MODELS Probabilistic vs non-probabilistic soft clustering The weight vector φ i is like a probability of x i being assigned to each cluster. A mixture model is a probabilistic model where φ i actually is a probability distribution according to the model. Mixture models work by defining: ◮ A prior distribution on the cluster assignment indicator c i ◮ A likelihood distribution on observation x i given the assignment c i Intuitively we can connect a mixture model to the Bayes classifier: ◮ Class prior → cluster prior. This time, we don’t know the “label” ◮ Class-conditional likelihood → cluster-conditional likelihood

  9. M IXTURE MODELS (a) A probability distribution on R 2 . (b) Data sampled from this distribution. Before introducing math, some key features of a mixture model are: 1. It is a generative model (defines a probability distribution on the data) 2. It is a weighted combination of simpler distributions. ◮ Each simple distribution is in the same distribution family (i.e., a Gaussian). ◮ The “weighting” is defined by a discrete probability distribution.

  10. M IXTURE MODELS Generating data from a mixture model Data : x 1 , . . . , x n , where each x i ∈ X (can be complicated, but think X = R d ) Model parameters : A K -dim distribution π and parameters θ 1 , . . . , θ K . Generative process : For observation number i = 1 , . . . , n , iid ∼ Discrete ( π ) ⇒ Prob ( c i = k | π ) = π k . 1. Generate cluster assignment: c i 2. Generate observation: x i ∼ p ( x | θ c i ) . Some observations about this procedure: ◮ First, each x i is randomly assigned to a cluster using distribution π . ◮ c i indexes the cluster assignment for x i ◮ This picks out the index of the parameter θ used to generate x i . ◮ If two x ’s share a parameter, they are clustered together.

  11. M IXTURE MODELS (a) Uniform mixing weights (b) Data sampled from this distribution. (c) Uneven mixing weights (d) Data sampled from this distribution.

  12. G AUSSIAN MIXTURE MODELS

  13. I LLUSTRATION Gaussian mixture models are mixture models where p ( x | θ ) is Gaussian. Mixture of two Gaussians The red line is the density function. π = [ 0 . 5 , 0 . 5 ] ( µ 1 , σ 2 1 ) = ( 0 , 1 ) ( µ 2 , σ 2 2 ) = ( 2 , 0 . 5 ) Influence of mixing weights The red line is the density function. π = [ 0 . 8 , 0 . 2 ] ( µ 1 , σ 2 1 ) = ( 0 , 1 ) ( µ 2 , σ 2 2 ) = ( 2 , 0 . 5 )

  14. G AUSSIAN MIXTURE MODELS (GMM) The model Parameters: Let π be a K -dimensional probability distribution and ( µ k , Σ k ) be the mean and covariance of the k th Gaussian in R d . Generate data: For the i th observation, 1. Assign the i th observation to a cluster, c i ∼ Discrete ( π ) 2. Generate the value of the observation, x i ∼ N ( µ c i , Σ c i ) Definitions: µ = { µ 1 , . . . , µ K } and Σ = { Σ 1 , . . . , Σ k } . Goal: We want to learn π, µ and Σ .

  15. G AUSSIAN MIXTURE MODELS (GMM) Maximum likelihood Objective: Maximize the likelihood over model parameters π , µ and Σ by treating the c i as auxiliary data using the EM algorithm. n n K � � � p ( x 1 , . . . , x n | π, µ , Σ ) = p ( x i | π, µ , Σ ) = p ( x i , c i = k | π, µ , Σ ) i = 1 i = 1 k = 1 The summation over values of each c i “integrates out” this variable. We can’t simply take derivatives with respect to π , µ k and Σ k and set to zero to maximize this because there’s no closed form solution. We could use gradient methods, but EM is cleaner.

  16. EM A LGORITHM Q : Why not instead just include each c i and maximize � n i = 1 p ( x i , c i | π, µ , Σ ) since (we can show) this is easy to do using coordinate ascent? A : We would end up with a hard-clustering model where c i ∈ { 1 , . . . , K } . Our goal here is to have soft clustering, which EM does. EM and the GMM We will not derive everything from scratch. However, we can treat c 1 , . . . , c n as the auxiliary data that we integrate out. Therefore, we use EM to n n � � ln p ( x i | π, µ , Σ ) ln p ( x i , c i | π, µ , Σ ) maximize by using i = 1 i = 1 Let’s look at the outlines of how to derive this.

  17. T HE EM ALGORITHM AND THE GMM From the last lecture, the generic EM objective is � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) The EM objective for the Gaussian mixture model is n n K q ( c i = k ) ln p ( x i , c i = k | π, µ , Σ ) � � � ln p ( x i | π, µ , Σ ) = + q ( c i = k ) i = 1 i = 1 k = 1 n K q ( c i = k ) � � q ( c i = k ) ln p ( c i | x i , π, µ , Σ ) i = 1 k = 1 Because c i is discrete, the integral becomes a sum.

  18. EM S ETUP ( ONE ITERATION ) First: Set q ( c i = k ) ⇐ p ( c i = k | x i , π, µ , Σ ) using Bayes rule: p ( c i = k | x i , π, µ , Σ ) ∝ p ( c i = k | π ) p ( x i | c i = k , µ , Σ ) We can solve the posterior of c i given π , µ and Σ : π k N ( x i | µ k , Σ k ) q ( c i = k ) = = ⇒ φ i ( k ) � j π j N ( x i | µ j , Σ j ) E-step: Take the expectation using the updated q ’s n K � � L = φ i ( k ) ln p ( x i , c i = k | π, µ k , Σ k ) + constant w.r.t. π , µ , Σ i = 1 k = 1 M-step: Maximize L with respect to π and each µ k , Σ k .

  19. M- STEP CLOSE UP Aside: How has EM made this easier? Original objective function: n K n K � � � � L = p ( x i , c i = k | π, µ k , Σ k ) = π k N ( x i | µ k , Σ k ) . ln ln i = 1 k = 1 i = 1 k = 1 The log-sum form makes optimizing π , and each µ k and Σ k difficult. Using EM here, we have the M-Step: n K � � L = φ i ( k ) { ln π k + ln N ( x i | µ k , Σ k ) } + constant w.r.t. π , µ , Σ � �� � i = 1 k = 1 ln p ( x i , c i = k | π,µ k , Σ k ) The sum-log form is easier to optimize. We can take derivatives and solve.

  20. EM FOR THE GMM Algorithm: Maximum likelihood EM for the GMM Given: x 1 , . . . , x n where x ∈ R d Goal: Maximize L = � n i = 1 ln p ( x i | π, µ , Σ ) . ◮ Iterate until incremental improvement to L is “small” 1. E-step : For i = 1 , . . . , n , set π k N ( x i | µ k , Σ k ) φ i ( k ) = j π j N ( x i | µ j , Σ j ) , for k = 1 , . . . , K � 2. M-step : For k = 1 , . . . , K , define n k = � n i = 1 φ i ( k ) and update the values n n π k = n k µ k = 1 Σ k = 1 � � φ i ( k )( x i − µ k )( x i − µ k ) T n , φ i ( k ) x i n k n k i = 1 i = 1 Comment: The updated value for µ k is used when updating Σ k .

  21. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 A random initialization 0 −2 −2 0 2 (a)

  22. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 Iteration 1 (E-step) 0 Assign data to clusters −2 −2 0 2 (b)

  23. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 1 Iteration 1 (M-step) 0 Update the Gaussians −2 −2 0 2 (c)

  24. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 2 Iteration 2 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (d)

  25. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 5 Iteration 5 (skipping ahead) 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (e)

  26. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 20 Iteration 20 (convergence) 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (f)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend