CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra Sheikhbahaee University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1
Outline β’ πΏ -mean Clustering β’ Gaussian Mixture model β’ EM for Gaussian Mixture model β’ EM algorithm University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
πΏ -mean Clustering β’ The organization of unlabeled data into similarity groups called clusters. β’ A cluster is a collection of data items which are similar between them, and dissimilar to data items in other clusters. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3
πΏ -mean clustering β’ πΏ -mean clustering has been used for image segmentation. β’ In image segmentation, one partitions an image into regions each of which has a reasonably homogeneous visual appearance or which corresponds to objects or parts of objects. β’ In data compression, for an RGB image with π pixels values of each is stored with 8 bits precision. Total cost of the original image transmission 24π bits Transmitting the identity of nearest centroid for each pixel has the total cost of π log ! πΏ transmitting the πΏ centroid vectors requires 24πΏ bits The compressed image has the cost of 24πΏ + π log ! πΏ bits University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
πΏ -mean clustering β’ Let a data set be π¦ ) , β¦ , π¦ * which is π observations of a random πΈ - dimensional Euclidean variable π . β’ The πΏ -means algorithm partitions the given data into πΏ clusters ( πΏ is known): β Each cluster has a cluster center, called centroid ( π ! where π = 1 β¦ πΏ ). β The sum of the squares of the distances of each data point to its closest vector π ! , is a minimum β Each data point π¦ " has a corresponding set of binary indicator variables π "! which represent whether data point π¦ # belongs to cluster π or not π "! = 2 1 if π¦ # is assigned to cluster π 0 otherwise University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5
πΏ -mean clustering β’ Let a data set be π¦ ) , β¦ , π¦ * which is π observations of a random πΈ - dimensional Euclidean variable π . β’ The πΏ -means algorithm partitions the given data into πΏ clusters ( πΏ is known): β Each cluster has a cluster center, called centroid ( π ! where π = 1 β¦ πΏ ). β The sum of the squares of the distances of each data point to its closest vector π ! , is a minimum β Each data point π¦ " has a corresponding set of binary indicator variables π "! which represent whether data point π¦ " belongs to cluster π or not # & π "! β₯ π¦ " β π ! β₯ ' πΎ = B B "$% !$% University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
πΏ -mean clustering Algorithm : Initialize π % , β¦ , π & Iterations: β’ We minimize J with respect to the π !" , keeping the π " fixed by assigning each data point to the closest centroid β’ We minimize J with respect to the π " , keeping the π !" fixed by recomputing the centroids using the current cluster membership Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
πΏ -mean clustering Algorithm : Initialize π " , β¦ , π # Iterations: β’ We minimize J with respect to the π !" , keeping the π " fixed by assigning each data point to the closest centroid β₯ π¦ ! β π # β₯ $ 1 if π = arg min π !" = ( # 0 otherwise We minimize J with respect to the π " , keeping the π β’ !" fixed by recomputing the centroids using the current cluster membership ' ππΎ β ! π !" π¦ ! = β2 ? π !" π¦ ! β π " = 0 β π " = β ! π ππ " !" !%& Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
πΏ -mean clustering (Cons) β’ πΏ -mean algorithm may converge to a local rather than a global minimum of πΎ β’ The πΏ -means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector . β’ In πΏ -means algorithm, every data point is assigned uniquely to one, and only one, of the clusters (hard assignment to the nearest cluster). University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9
Gaussian Mixture Model β’ Let a data set be π = π¦ ) , β¦ , π¦ * observations. A linear superposition of Gaussians I π π = + π G πͺ(π|π G , π― G ) GH) π G : mixing coefficient β’ Let a binary random variable π which has πΏ dimensions and satisfies the following condition I + π¨ G = 1, where π¨ G β {0,1} GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10
Gaussian Mixture Model β’ The joint distribution of the observed π and hidden variable π is π π, π = π π π π(π) The marginal distribution over π π π¨ G = 1 = π G , where 0 β€ π G β€ 1 I J ! = Cat(π|π) π π = ? π G GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
Gaussian Mixture Model β’ The joint distribution of the observed π and hidden variable π is π π, π = π π π π(π) The marginal distribution over π I J ! = Cat(π|π) π π = ? π G GH) The conditional distribution of π given a particular value for π π π π¨ G = 1 = πͺ(π|π G , π― G ) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12
Gaussian Mixture Model β’ The joint distribution of the observed π and hidden variable π is π π, π = π π π π(π) The marginal distribution over π I J ! = Cat(π|π) π π = ? π G GH) The conditional distribution of π given a π I πͺ(π|π G , π― G ) J ! π π π = ? GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
Gaussian Mixture Model β’ The joint distribution of the observed π and hidden variable π is π π, π = π π π π(π) The marginal distribution over π I π π = + π(π|π) π π = + π G πͺ(π|π G , π― G ) GH) J University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14
Gaussian Mixture Model β’ The posterior distribution of π π π π¨ ! = 1 π(π¨ ! = 1) π π¨ ! = 1 π = " = 1) = % β "#$ π π π¨ " = 1 π(π¨ π ! πͺ(π|π ! , π― ! ) ! β "#$ π " πͺ(π|π " , π― " ) Let assume we have i.i.d data set. The log-likelihood function is ' ) % ln π π π, π, π― = ln 4 5 π(π¦ ' |π¨ ' )π π¨ ' = 5 ln 5 π ! πͺ(π¦ ' |π ! , Ξ£ ! ) &#$ ( ! '#$ !#$ University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15
Gaussian Mixture Model The log-likelihood function is ) % ln π π π, π, π― = 5 ln 5 π ! πͺ(π¦ ' |π ! , Ξ£ ! ) '#$ !#$ β’ Problems: β’ Singularities : Arbitrarily large likelihood when a Gaussian explains a single point (whenever one of the Gaussian components collapses onto a specific data point) β’ Identifiability : Solution is invariant to permutations A total of πΏ! equivalent solutions because of the πΏ! ways of assigning πΏ sets of parameters to πΏ components. β’ Non-convex University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16
E xpectation- M aximization For GMM β’ Let assume that we do not have access to the complete data set π, π .Then the actual observed π is considered as incomplete data. So we can not use the complete data log-likelihood β π π, π = ln π(π, π|π) We consider the expected value of the likelihood function under the posterior distribution of latent variable = β J π(π|π, π RST ) ln π(π, π|π) π½ K(M|O,P "#$ ) β π π, π which is the E xpectation step of the EM algorithm. In the M aximization step, we maximize this expectation. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17
E xpectation- M aximization For GMM Algorithm : Initialize π + Iterations: E step: Evaluate the posterior distribution of the latent variables π and compute π π, π BCD = B π(π|π, π BCD ) ln π(π, π|π) E M step: Evaluate π FGH π FGH = ππ π max π π, π BCD I Check for the convergence of either the log-likelihood or the parameter values, otherwise π BCD β π FGH University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18
E xpectation- M aximization For GMM β’ The likelihood function of complete data * I J %! πͺ(π¦ _ |π G , π― G ) J %! π π, π π, Ξ£, π = ? ? π G _H) GH) The log-likelihood * I β π π, π = ln π π, π π, Ξ£, π = + + π¨ _G {ln π G + ln πͺ(π¦ _ |π G , π― G )} _H) GH) The summation over π and the logarithm have been interchanged. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 19
Recommend
More recommend