cs480 680 machine learning lecture 12 february 13 th 2020
play

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra Sheikhbahaee University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1 Outline -mean Clustering Gaussian Mixture model EM for


  1. CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra Sheikhbahaee University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1

  2. Outline • 𝐿 -mean Clustering • Gaussian Mixture model • EM for Gaussian Mixture model • EM algorithm University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

  3. 𝐿 -mean Clustering • The organization of unlabeled data into similarity groups called clusters. • A cluster is a collection of data items which are similar between them, and dissimilar to data items in other clusters. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3

  4. 𝐿 -mean clustering • 𝐿 -mean clustering has been used for image segmentation. • In image segmentation, one partitions an image into regions each of which has a reasonably homogeneous visual appearance or which corresponds to objects or parts of objects. • In data compression, for an RGB image with 𝑂 pixels values of each is stored with 8 bits precision. Total cost of the original image transmission 24𝑂 bits Transmitting the identity of nearest centroid for each pixel has the total cost of 𝑂 log ! 𝐿 transmitting the 𝐿 centroid vectors requires 24𝐿 bits The compressed image has the cost of 24𝐿 + 𝑂 log ! 𝐿 bits University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

  5. 𝐿 -mean clustering • Let a data set be 𝑦 ) , … , 𝑦 * which is 𝑂 observations of a random 𝐸 - dimensional Euclidean variable 𝒚 . • The 𝐿 -means algorithm partitions the given data into 𝐿 clusters ( 𝐿 is known): – Each cluster has a cluster center, called centroid ( 𝝂 ! where 𝑙 = 1 … 𝐿 ). – The sum of the squares of the distances of each data point to its closest vector 𝝂 ! , is a minimum – Each data point 𝑦 " has a corresponding set of binary indicator variables 𝑠 "! which represent whether data point 𝑦 # belongs to cluster 𝑙 or not 𝑠 "! = 2 1 if 𝑦 # is assigned to cluster 𝑙 0 otherwise University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5

  6. 𝐿 -mean clustering • Let a data set be 𝑦 ) , … , 𝑦 * which is 𝑂 observations of a random 𝐸 - dimensional Euclidean variable 𝒚 . • The 𝐿 -means algorithm partitions the given data into 𝐿 clusters ( 𝐿 is known): – Each cluster has a cluster center, called centroid ( 𝝂 ! where 𝑙 = 1 … 𝐿 ). – The sum of the squares of the distances of each data point to its closest vector 𝝂 ! , is a minimum – Each data point 𝑦 " has a corresponding set of binary indicator variables 𝑠 "! which represent whether data point 𝑦 " belongs to cluster 𝑙 or not # & 𝑠 "! ∥ 𝑦 " − 𝜈 ! ∥ ' 𝐾 = B B "$% !$% University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

  7. 𝐿 -mean clustering Algorithm : Initialize 𝜈 % , … , 𝜈 & Iterations: • We minimize J with respect to the 𝑠 !" , keeping the 𝜈 " fixed by assigning each data point to the closest centroid • We minimize J with respect to the 𝜈 " , keeping the 𝑠 !" fixed by recomputing the centroids using the current cluster membership Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

  8. 𝐿 -mean clustering Algorithm : Initialize 𝜈 " , … , 𝜈 # Iterations: • We minimize J with respect to the 𝑠 !" , keeping the 𝜈 " fixed by assigning each data point to the closest centroid ∥ 𝑦 ! − 𝜈 # ∥ $ 1 if 𝑙 = arg min 𝑠 !" = ( # 0 otherwise We minimize J with respect to the 𝜈 " , keeping the 𝑠 • !" fixed by recomputing the centroids using the current cluster membership ' 𝜖𝐾 ∑ ! 𝑠 !" 𝑦 ! = −2 ? 𝑠 !" 𝑦 ! − 𝜈 " = 0 → 𝜈 " = ∑ ! 𝑠 𝜖𝜈 " !" !%& Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

  9. 𝐿 -mean clustering (Cons) • 𝐿 -mean algorithm may converge to a local rather than a global minimum of 𝐾 • The 𝐿 -means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector . • In 𝐿 -means algorithm, every data point is assigned uniquely to one, and only one, of the clusters (hard assignment to the nearest cluster). University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9

  10. Gaussian Mixture Model • Let a data set be 𝒚 = 𝑦 ) , … , 𝑦 * observations. A linear superposition of Gaussians I 𝑞 𝒚 = + 𝜌 G 𝒪(𝒚|𝝂 G , 𝜯 G ) GH) 𝜌 G : mixing coefficient • Let a binary random variable 𝒜 which has 𝐿 dimensions and satisfies the following condition I + 𝑨 G = 1, where 𝑨 G ∈ {0,1} GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10

  11. Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 𝑞 𝑨 G = 1 = 𝜌 G , where 0 ≤ 𝜌 G ≤ 1 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11

  12. Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) The conditional distribution of 𝒚 given a particular value for 𝒜 𝑞 𝒚 𝑨 G = 1 = 𝒪(𝒚|𝝂 G , 𝜯 G ) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12

  13. Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) The conditional distribution of 𝒚 given a 𝒜 I 𝒪(𝒚|𝝂 G , 𝜯 G ) J ! 𝑞 𝒚 𝒜 = ? GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13

  14. Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒚 I 𝑞 𝒚 = + 𝑞(𝒚|𝒜) 𝑞 𝒜 = + 𝜌 G 𝒪(𝒚|𝝂 G , 𝜯 G ) GH) J University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

  15. Gaussian Mixture Model • The posterior distribution of 𝒜 𝑞 𝒚 𝑨 ! = 1 𝑞(𝑨 ! = 1) 𝑞 𝑨 ! = 1 𝒚 = " = 1) = % ∑ "#$ 𝑞 𝒚 𝑨 " = 1 𝑞(𝑨 𝜌 ! 𝒪(𝒚|𝝂 ! , 𝜯 ! ) ! ∑ "#$ 𝜌 " 𝒪(𝒚|𝝂 " , 𝜯 " ) Let assume we have i.i.d data set. The log-likelihood function is ' ) % ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 4 5 𝑞(𝑦 ' |𝑨 ' )𝑞 𝑨 ' = 5 ln 5 𝜌 ! 𝒪(𝑦 ' |𝜈 ! , Σ ! ) &#$ ( ! '#$ !#$ University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15

  16. Gaussian Mixture Model The log-likelihood function is ) % ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 5 ln 5 𝜌 ! 𝒪(𝑦 ' |𝜈 ! , Σ ! ) '#$ !#$ • Problems: • Singularities : Arbitrarily large likelihood when a Gaussian explains a single point (whenever one of the Gaussian components collapses onto a specific data point) • Identifiability : Solution is invariant to permutations A total of 𝐿! equivalent solutions because of the 𝐿! ways of assigning 𝐿 sets of parameters to 𝐿 components. • Non-convex University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

  17. E xpectation- M aximization For GMM • Let assume that we do not have access to the complete data set 𝒀, 𝒂 .Then the actual observed 𝒀 is considered as incomplete data. So we can not use the complete data log-likelihood ℒ 𝜄 𝑌, 𝑎 = ln 𝑄(𝑌, 𝑎|𝜄) We consider the expected value of the likelihood function under the posterior distribution of latent variable = ∑ J 𝑄(𝑎|𝑌, 𝜄 RST ) ln 𝑄(𝑌, 𝑎|𝜄) 𝔽 K(M|O,P "#$ ) ℒ 𝜄 𝑌, 𝑎 which is the E xpectation step of the EM algorithm. In the M aximization step, we maximize this expectation. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17

  18. E xpectation- M aximization For GMM Algorithm : Initialize 𝜄 + Iterations: E step: Evaluate the posterior distribution of the latent variables 𝑎 and compute 𝒭 𝜄, 𝜄 BCD = B 𝑞(𝑎|𝑌, 𝜄 BCD ) ln 𝑞(𝑌, 𝑎|𝜄) E M step: Evaluate 𝜄 FGH 𝜄 FGH = 𝑏𝑠𝑕 max 𝒭 𝜄, 𝜄 BCD I Check for the convergence of either the log-likelihood or the parameter values, otherwise 𝜄 BCD ← 𝜄 FGH University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18

  19. E xpectation- M aximization For GMM • The likelihood function of complete data * I J %! 𝒪(𝑦 _ |𝝂 G , 𝜯 G ) J %! 𝑞 𝑌, 𝑎 𝜈, Σ, 𝜌 = ? ? 𝜌 G _H) GH) The log-likelihood * I ℒ 𝜄 𝑌, 𝑎 = ln 𝑄 𝑌, 𝑎 𝜈, Σ, 𝜌 = + + 𝑨 _G {ln 𝜌 G + ln 𝒪(𝑦 _ |𝝂 G , 𝜯 G )} _H) GH) The summation over 𝑙 and the logarithm have been interchanged. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend