Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani - - PowerPoint PPT Presentation

clustering i
SMART_READER_LITE
LIVE PREVIEW

Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani - - PowerPoint PPT Presentation

Machine Learning Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Unsupervised Learning Quality Measurement Similarity Measures


slide-1
SLIDE 1

Machine Learning

Clustering I

Hamid R. Rabiee

Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

slide-2
SLIDE 2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Unsupervised Learning  Quality Measurement  Similarity Measures  Major Clustering Approaches

 Distance Measuring  Partitioning Methods  Hierarchical Methods  Density Based Methods  Spectral Clustering  Other Methods

 Constraint Based Clustering  Clustering as Optimization

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

slide-3
SLIDE 3

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Unsup nsupervi ervised Learning sed Learning

 Clustering or unsupervised classification is aimed at discovering natural groupings in a set

  • f data.

 Note: All samples in the training set are unlabeled.

 Applications for clustering:

 Spatial data analysis: Create thematic maps in GIS by clustering feature space  Image processing: Segmentation  Economic science: Discover distinct groups in costumer bases  Internet: Document classification  To gain insight into the structure of the data prior to classifier design; classifier design

slide-4
SLIDE 4

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Qualit Quality y Mea Measure suremen ment

 High quality clusters must have

 high intra-class similarity  low inter-class similarity

 Some other measures

 Ability to discover hidden patterns  Judged by the user  Purity  Suppose we know the labels of the data, assign to each cluster its most frequent class  Purity is the number of correctly assigned points divided by the number of data

slide-5
SLIDE 5

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Sim Simil ilari arity ty Measures Measures

 Distances are normally used to measure the similarity or dissimilarity between two data objects

 Some popular distances are Minkowski and Mahalanobis.  Distance between binary strings d(S1,S2)=|{(s1,i,s2,i) : s1,i ≠ s2,i}|  Distance between vector objects

T

X .Y d(X,Y) X Y

slide-6
SLIDE 6

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Maj Major Cl

  • r Clusteri

ustering ng Appr Approach

  • aches

es

 Partitioning approach

 Construct various partitions and then evaluate them by some criterion (ex. k-means, c-means, k-medoids)

 Hierarchical approach

 Create a hierarchical decomposition of the set of data using some criterion (ex. Agnes)

 Density-based approach

 Based on connectivity and density functions (ex. DBSACN, OPTICS)

 Graph-based approach (Spectral Clustering)

 approximately optimizing the normalized cut criterion

 Grid-based approach

 based on a multiple-level granularity structure (ex. STING, WaveCluster, CLIQUE)

 Model-based

 A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (ex. EM, SOM)

slide-7
SLIDE 7

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Di Distance stance Measuri Measuring ng

 Single link

 smallest distance between an element in one cluster and an element in the other

 Complete link

 largest distance between an element in one cluster and an element in the other

 Average

 avg distance between an element in one cluster and an element in the other

 Centroid

 distance between the centroids of two clusters  Used in k-means

 Medoid

 distance between the medoids of two clusters  Medoid: A representative object whose average dissimilarity to all the objects in the cluster is minimal

slide-8
SLIDE 8

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Parti Partiti tioning M

  • ning Methods

ethods

 Construct a partition of n data into a set of k clusters, s.t., min sum of squared distance

where Cms are clusters representatives.  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means, c-means and k-medoids algorithms  k-means: Each cluster is represented by the center of the cluster  c-means: The fuzzy version of k-means  k-medoids: Each cluster is represented by one of the samples in the cluster

j m

k 2 m 1 x Cluster j m

min (x C )

slide-9
SLIDE 9

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Parti Partiti tioning M

  • ning Methods: k

ethods: k-means means

 k-means

 Suppose we know there are K categories and each category is represented by its sample mean  Given a set of unlabeled training samples, how to estimate the means?

 Algorithm k-means (k)

  • 1. Partition samples into k non-empty subsets (random initialization)
  • 2. Compute mean points of the clusters of the current partition
  • 3. Assign each sample to the cluster with the nearest mean point
  • 4. Go back to Step 2, stop when no more new assignment
slide-10
SLIDE 10

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Parti Partiti tioning M

  • ning Methods: k

ethods: k-means means

 Some notes on k-means

 Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers (Why?)  Not suitable to discover clusters with non-convex shapes (Why?)  Algorithm is sensitive to  number of cluster centers,  choice of initial cluster centers  sequence in which data are processed (Why?)  Convergence not guaranteed, but results acceptable if there are well-separated clusters

slide-11
SLIDE 11

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Parti Partiti tioning M

  • ning Methods: c

ethods: c-means means

 The membership function μil expresses to what degree xl belongs to class Ci.  Crisp clustering: xl can belong to one class only  Fuzzy clustering: xl belongs to all classes simultaneously with varying degrees of membership

 where z(m)s are cluster means  q is a fuzziness index with 1<q<2  Fuzzy clustering becomes crisp clustering when q→1  Observe that  C-mean minimizes

1

l i il l i

if x C if x C

1 1 ( ) 1 1 ( ) 1

1 ( , ) 1 ( , )

q m i l il q k m i i l

d z x d z x

1

1, 1,2,..., .

k il i

for l N

2 ( ) 1 1

, ( )

k N f f q m e i i il i l i l

J J J z x

slide-12
SLIDE 12

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Parti Partiti tioning M

  • ning Methods: k

ethods: k-medoids medoids

 k-medoids

 Instead of taking the mean value of the samples in a cluster as a reference point, medoids can be used  Note that choosing the new medoids is slightly different with choosing the new means in k- means algorithm

 Algorithm k-medoids (k)

  • 1. Select k representative samples arbitrarily
  • 2. Associate each data point to the closest medoid
  • 3. For each medoid m and data point o

Swap m and o and compute the total cost of configuration

  • 4. Select the configuration with the lowest cost
  • 5. repeat steps 2-5 until there is no change
slide-13
SLIDE 13

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

Parti Partiti tioning M

  • ning Methods: k

ethods: k-medoids medoids

 Some notes on k-medoids

 k-medoids is more robust than k-means in the presence of noise and outliers (Why?)  works effectively for small data sets, but does not scale well for large data sets  For Large data sets we can use sampling based methods (How?)

slide-14
SLIDE 14

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Hier ierarchical Met archical Methods hods

 Clusters have sub-clusters and sub-clusters can have sub-sub-clusters, …  Use distance matrix as clustering criteria.  This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES) divisive (DIANA)

a b c d e a b d e c d e a b c d e

slide-15
SLIDE 15

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Hier ierarchical Met archical Methods hods

 Agglomerative Hierarchical Clustering

 AGNES (Agglomerative Nesting)  Uses the Single-Link method  Merge nodes (clusters) that have the maximum similarity

 divisive Hierarchical Clustering

 DIANA (Divisive Analysis)  Inverse order of AGNES  Eventually each node forms a cluster on its own

slide-16
SLIDE 16

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Hier ierarchical Met archical Methods hods

 Dendrogram

 Shows How the Clusters are Merged  Decompose samples into a several levels of nested partitioning (tree of clusters), called a dendrogram.  A clustering of the samples is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

slide-17
SLIDE 17

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

Densi Density ty Based M Based Methods ethods

 Clustering based on density (local cluster criterion), such as density-connected points  Major features:

 Discover clusters of arbitrary shapes  Handle noise  Need density parameters as termination condition

slide-18
SLIDE 18

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Densi Density ty Based M Based Methods ethods

 Main Concepts:

 parameters:  Eps: Maximum radius of the neighborhood  MinPts: Minimum number of points in an Eps-neighbourhood

  • f that point

 Sample q is directly density-reachable from sample p, if d(p,q)<=Eps and p has MinPts points in its neighborhood.  Sample q is density-reachable from a sample p if there is a chain

  • f points p1, …, pn, p1 = p, pn = q such that pi+1 is directly density-

reachable from pi  Sample p is density-connected to sample q if there is a sample o such that both, p and q are density-reachable from o.

p q

  • q

p p1

slide-19
SLIDE 19

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Densi Density ty Based M Based Methods: ethods: DB DBSCA SCAN

 DBSCAN (Density Based Spatial Clustering of Applications with Noise)

 Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points  Discovers clusters of arbitrary shape in spatial data with noise

 Algorithm DBSCAN (Eps, MinPts)

 Arbitrary select a sample p  Retrieve all samples density-reachable from p w.r.t. Eps and MinPts.  If p is a core sample (some samples are density-reachable from p), a cluster is formed.  If p is a border sample (no samples are density-reachable from p), DBSCAN visits the next sample of the database.  Continue the process until all of the samples have been processed.

slide-20
SLIDE 20

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Graph raph-ba based sed Cl Cluste ustering ring

 Represent data points as the vertices V of a graph G.  All pairs of vertices are connected by an edge E.  Edges have weights W.

 Large weights mean that the adjacent vertices are very similar; small weights imply dissimilarity.

slide-21
SLIDE 21

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

21

Graph raph-ba based sed Cl Cluste ustering ring

 Clustering on a graph is equivalent to partitioning the vertices of the graph.  A loss function for a partition of V into sets A and B  In a good partition, vertices in different partitions will be dissimilar.

 Mincut criterion: Find a partition A, B that minimizes cut(A,B)

 Mincut criterion ignores the size of the subgraphs formed

, ,

( , )

u v u A v B

cut A B W

slide-22
SLIDE 22

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

22

Graph raph-ba based sed Cl Cluste ustering ring

 Normalized cut criterion favors balanced partitions.  Minimizing the normalized cut criterion exactly is NP-hard.  One way of approximately optimizing the normalized cut criterion leads to spectral clustering.

, , , ,

( , ) ( , ) ( , )

u v u v u A v V u B v V

cut A B cut A B Ncut A B W W

slide-23
SLIDE 23

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

23

Spe Spectral ctral Cl Clusteri ustering ng

 Spectral clustering

 Looks for a new representation of the original data points, such that  Preserve the edge weights.  The convex clusters’ shapes in the new space represents non-convex ones in the

  • riginal space.

 Cluster the points in the new space using any clustering scheme (say k-means).

 We only describe the resulting algorithm here.

 For more information about derivations, refer to U. Luxburg, “A Tutorial on Spectral Clustering”.

slide-24
SLIDE 24

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

24

Spe Spectral ctral Cl Clusteri ustering ng

 Inputs

 Set of points and number of clusters k

 Algorithm

 Form the edge weights matrix  For example:

 Scaling parameter chosen by user

 Define D a diagonal matrix whose (i,i) element is the sum of W’s row i  Form the matrix  Find the k largest eigenvectors of L to form the matrix Xnxk

nxn

W R 

1/2 1/2

L D WD

 

2 2

/2

i j

s s ij

e if i j W else

1,..., l n

S S S R

slide-25
SLIDE 25

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

25

Spe Spectral ctral Cl Clusteri ustering ng

 Algorithm (cont.)

 Normalized the matrix Xnxk and form matrix Ynxk  Treat each row of Y as a point in Rk (data dimensionality reduction from n to k)  Cluster the new data into k clusters via K-means

2

/

ij ij ij j

Y X X 

slide-26
SLIDE 26

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

26

Spe Spectral ctral Cl Clusteri ustering ng

 Example

 simple edge weights matrix (d(xi,xj) denotes Euclidean distance between points xi, xj and θ=1)

   

 

1 , , ,

i j

if d x x W i j W j i

  • therwise

        

1

(0.7,0.7,0,0)T e 

2

(0,0,0.7,0.7)T e 

1

(0.7,0,0.7,0)T e 

2

(0,0.7,0,0.7)T e 

~

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a b c d a c b d a a W W b c c b d d                                   b c 1 d a

slide-27
SLIDE 27

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

27

Spe Spectral ctral Cl Clusteri ustering ng

 Another example

slide-28
SLIDE 28

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

28

Other Met ther Methods hods

 Grid based methods

 Using multi-resolution grid data structure.

  • 1. Create the grid structure, i.e., partition the data space into a finite number of cells
  • 2. Calculate the cell density for each cell
  • 3. Sort the cells according to their densities
  • 4. Identify cluster centers
  • 5. Traverse the neighbor cells
slide-29
SLIDE 29

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

29

Other Met ther Methods hods

 Model based methods

 Attempt to optimize the fit between the given data and some mathematical model  Based on the assumption: Data are generated by a mixture of underlying probability distribution  Typical methods  Statistical approach: EM (Expectation maximization) – will be discussed later  Neural network approach: SOM (Self-Organizing Feature Map)

slide-30
SLIDE 30

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

30

Const Constrai raint nt Based C Based Clust lusteri ering ng

 Why constraint based clustering?

 Need user feedback: Users know their applications the best  Less parameters but more user-desired constraints, e.g., an ATM allocation problem:

  • bstacle & desired clusters

 Different constraints in cluster analysis:

 Constraints on individual samples (do selection first)  Cluster on samples which …  Constraints on distance or similarity functions  Weighted functions, obstacles  Constraints on the selection of clustering parameters  Number of clusters, limitation of each cluster size  User-specified constraints  Some samples must be in cluster and some others not!  Semi-supervised: giving small training sets as “constraints” or hints

slide-31
SLIDE 31

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

31

Const Constrai raint nt Based C Based Clust lusteri ering ng

 A sample data and two answers (taking the constraints into account and not taking the constraints into account)  Constraints: The data in different sides of each “wall” should be in different clusters

slide-32
SLIDE 32

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

32

Cl Clusteri ustering ng as as Opti ptimi mizati zation

  • n

 Clustering can be posted as an optimization of a criterion function

 The sum-of-squared-error criterion  Scatter criteria

 The given criterion function is optimized through iterative optimization

slide-33
SLIDE 33

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

33

Any Q Any Questi uestion?

  • n?

End of Lecture 19 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1

slide-34
SLIDE 34

Machine Learning

Clustering II

Hamid R. Rabiee [Slides are based on Bishop Book]

Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

slide-35
SLIDE 35

 Problem of identifying groups, or clusters, of data points in a multidimensional space

 Partitioning the data set into some number K of clusters  Cluster: a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster  Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center

  • f the cluster) is a minimum

 Objective function called distortion measure:

K-mea means Cl ns Cluste ustering ring

2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-36
SLIDE 36

3

K-mea means Cl ns Cluste ustering ring

 Two-stage optimization

 In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed  In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed

The mean of all of the data points assigned to cluster k

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-37
SLIDE 37

4

K-mea means Cl ns Cluste ustering ring

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-38
SLIDE 38

5

Mixtures Mixtures of

  • f Gauss

Gaussians ians

 Gaussian mixture distribution can be written as a linear superposition of Gaussian  An equivalent formulation of the Gaussian mixture involving an explicit latent variable

 Graphical representation of a mixture model  A binary random variable z having a 1-of-K representation  The marginal distribution of x is a Gaussian mixture of the form (*) ( for every observed data point xn, there is a corresponding latent variable zn)

…(*)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-39
SLIDE 39

6

Mi Mixtures xtures of

  • f Gaussians

aussians

 γ(zk) can also be viewed as the responsibility that component k takes for explaining the observation x

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-40
SLIDE 40

Mi Mixtures of G xtures of Gaussians aussians

 Generating random samples distributed according to the Gaussian mixture model

 Generating a value for z, which denoted as from the marginal distribution p(z) and then generate a value for x from the conditional distribution

7

slide-41
SLIDE 41

8

Mi Mixtures of G xtures of Gaussians aussians

a.

The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue

b.

The corresponding samples from the marginal distribution p(x)

c.

The same samples in which the colors represent the value of the responsibilities γ(znk) associated with data point

 Illustrating the responsibilities by evaluating the posterior probability for each component in the mixture distribution which this data set was generated

slide-42
SLIDE 42

9

Maxim Maximum um li likeli kelihood hood

 Graphical representation of a Gaussian mixture model for a set of N i.i.d. data points {xn}, with corresponding latent points {zn}  The log of the likelihood function

….(*1)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-43
SLIDE 43

10

Maxim Maximum um li likeli kelihood hood

 For simplicity, consider a Gaussian mixture whose components have covariance matrices given by

 Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn  This data point will contribute a term in the likelihood function of the form  Once there are at least two components in the mixture, one of the components can have a finite variance and therefore assign finite probability to all of the data points while the other component can shrink onto one specific data point and thereby contribute an ever increasing additive value to the log likelihood  over-fitting problem

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-44
SLIDE 44

11

Maxim Maximum um li likeli kelihood hood

 Over-fitting problem

 In applying maximum likelihood to a Gaussian mixture models, there should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved

 Identifiability problem

 A K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components  Difficulty of maximizing the log likelihood function  the presence of the summation

  • ver k that appears inside the logarithm gives no closed form solution as in the single

case

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-45
SLIDE 45

12

EM EM for G for Gau aussian ssian mixtures mixtures

I. Assign some initial values for the means, covariances, and mixing coefficients II. Expectation or E step

  • Using the current value for the parameters to evaluate the posterior probabilities
  • r responsibilities

III. Maximization or M step

  • Using the result of II to re-estimate the means, covariances, and mixing coefficients

 It is common to run the K-means algorithm in order to find a suitable initial values

  • The covariance matrices  the sample covariances of the clusters found by the K-

means algorithm

  • Mixing coefficients  the fractions of data points assigned to the respective

clusters

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-46
SLIDE 46

13

EM EM for G for Gau aussian ssian mixtures mixtures

 Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters

1. Initialize the means μk, covariance Σk and mixing coefficients πk 2. E step 3. M step

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-47
SLIDE 47

14

EM EM for G for Gau aussian ssian mixtures mixtures

 Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters

4: Evaluate the log likelihood

…(*2)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-48
SLIDE 48

15

EM EM for G for Gau aussian ssian mixtures mixtures

 Setting the derivatives of (*2) with respect to the means of the Gaussian components to zero   Setting the derivatives of (*2) with respect to the covariance of the Gaussian components to zero 

Responsibilityγ(znk )

A weighted mean of all of the points in the data set

  • Each data point weighted by the corresponding posterior probability
  • The denominator given by the effective # of points associated with

the corresponding component

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-49
SLIDE 49

16

EM EM for G for Gau aussian ssian mixtures mixtures

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-50
SLIDE 50

17

An An Al Alterna ternative tive Vi View ew of EM

  • f EM

 In maximizing the log likelihood function the summation prevents the logarithm from acting directly on the joint distribution  Instead, the log likelihood function for the complete data set {X, Z} is straightforward.  In practice since we are not given the complete data set, we consider instead its expected value Q under the posterior distribution p( Z|X, Θ) of the latent variable

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-51
SLIDE 51

18

An An Al Alterna ternative tive Vi View ew of EM

  • f EM

General EM

1. Choose an initial setting for the parameters Θold 2. E step Evaluate p(Z|X,Θold ) 3. M step Evaluate Θnew given by Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ) 4. It the covariance criterion is not satisfied, then let Θold  Θnew

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-52
SLIDE 52

19

Gaussian aussian mixtures mixtures revisited revisited

 Maximizing the likelihood for the complete data {X, Z}

The logarithm acts directly on the Gaussian distribution  much simpler solution to the maximum likelihood problem  the maximization with respect to a mean or a covariance is exactly as for a single Gaussian (closed form)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-53
SLIDE 53

20

Gaussian aussian mixtures mixtures revisited revisited

 Unknown latent variables  considering expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables

Posterior distribution The expected value of the indicator variable under this posterior distribution The expected value of the complete-data log likelihood function

…(*3)

1 1

( | , , , ) ( | , , , ) ( ) [ ( | , )] ( . (9.10), (9.11))

nk

N K z k n k k n k

p Z X μ p X Z μ p Z N x ref    

 

    



1

( 1) ( | 1) ( ) ( 1| ) ( ) ( | , ) ( ) ( | , )

nk n nk nk nk n n k n k k nk K j n j j j

p z p x z E z p z x p x N x z N x     

        

k 1 1

[ln ( , , , )] ( ){ln ln ( | , )}

N K nk n k k n k

E p X Z | μ z N x    

 

   



Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-54
SLIDE 54

21

Rel Relati ation

  • n to K

to K-mea means ns

 K-means performs a hard assignment of data points to the clusters (each data point is associated uniquely with one cluster  EM makes a soft assignment based on the posterior probabilities  K-means can be derived as a particular limit of EM for Gaussian mixtures:

As epsilon gets smaller, the terms for which is farthest will go to zero most quickly. Hence the responsibility go all zero except for the term k for which the responsibility will go to unit

2 n j

x  

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-55
SLIDE 55

22

Rel Relati ation

  • n to K

to K-mea means ns

 Thus maximizing the expected complete data log-likelihood is equivalent to minimizing the distortion measure J for the K-means  (In Elliptical K-means, the covariance is estimated also.)

2 2 2 2 1 1

exp{ / 2 } 1 1 ( | , ) exp , ( ) 2 2 exp{ / 2 } 1 [ln ( , | , , )] 2

k n k n k k n k nk j n j j N K nk n k n k

x p x I x z x E p X Z r x const               

 

                  

 

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-56
SLIDE 56

23

(1 1 ) (2 2 )

Mi Mixtures of B xtures of Bernoull ernoulli di i distr stributi ibutions

  • ns

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-57
SLIDE 57

24

Mi Mixtu xtures res of B

  • f Bern

ernou

  • ull

lli distributions i distributions

(1 ) 1 1 1 (1 ) 1 1

( | ) (1 ) (single component) ( ( , , ), ( ) ( , , ), ( ) { (1 )}) ( | , ) ( | ), ( | ) (1 ) (mixture) ( { },

i i i i

D x x i i i D D i i D K x x k k k ki ki k i

p x x E Cov diag p p p          

    

              

  

1 K k

x μ x x μ x Σ x μ x μ x μ μ μ , ,μ μ

 

1 1 1 1 1 2 ( ) ( ) ( )

( , , )) ( ) ( ) , ( ) ( ) ( ) ( ) ( ( )) ( ) ( ) ( ) ( ) (Let ( )) ( ), then ( | ) , ( |

k kD K K k k k k k k K K T T T k k k k k k k ij k ii k i k k ij k i j k

E E Cov E E E E E E E E E c c E x c E x x       

   

             

   

T T T k k T

x x | μ μ x xx x x xx | μ x x μ μ x x xx | μ μ μ

2 3 2

) ( ) and note that individual variabes are ) [ . .] (1, 0, 0, 1, 1) (single component): ( ) : ( ) (1 ) , ( ) ( , , ), ( ) (1 ) (mixture) : (

k i

i j x independent, given μ E g p H p E Cov I p H                     x x x x

3 2 3 2 1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 1 1

| ) , ( | ) : ( ) (1 ) (1 ) , ( ) ( , , ) ( ) { (1 ) } { (1 ) } ( ) { (1 ) (1 )} { ( c p H c p E Cov I I I                                                                 x x x 1 1 1

2 2 2) }

* Because the covariance matrix Cov( ) is no longer diagonal, the mixture distribution can capture correlations between the variables, unlike a single Bernulli distribution.    x 1

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-58
SLIDE 58

25

Mi Mixtu xtures res of B

  • f Bern

ernou

  • ull

lli distributions i distributions

1 1 T 1 1 1 1

ln ( | , ) ln ( | ) ( | , ) ( | ) ( ( , ) is a binary indicator variables) ( ) (complete-data log likelihood function) : ln ( , | , ) ln [

k k

N K k k n k K z K k K z k k K nk k k

p p p p z , z p p z   

    

           

    

n k

x μ x μ x z μ x μ z z X Z μ π π π |

1 1 Z 1 1 1 1 1

ln (1 )ln(1 )] E [ln ( , | , )] ( ) ln [ ln (1 )ln(1 )] ( | ) (E-step) ( ) [ ] , ( ), ( | )

N D ni ki ni ki n i N K D nk k ni ki ni ki n k i N k k nk nk k nk K n j j j

x x p z x x p z E z N z p          

      

                      

     

n n

X Z μ x μ x x μ π

1 k

1 ( ) (M-step) , * In contrast to the mixture of Gaussians, there are no singularities in which the likelihood goes to infinity

N nk n k k k

z N N N  

  

k n k

x μ x

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-59
SLIDE 59

26

Mi Mixtu xtures res of B

  • f Bern

ernou

  • ull

lli distributions i distributions

 N=600 digit images, 3 mixtures  A mixture of k=3 Bernoulli distributions by 10 EM iterations  Parameters for each of the three components/single multivariate Bernoulli  The analysis of Bernoulli mixtures can be extended to the case of multinomial binary variables having M>2 states (Ex.9.19)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

slide-60
SLIDE 60

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

27

Ref Reference erences

 Slides of Chapter 9 of Bishop book adapted from Biointelligence

Laboratory, Seoul National University http://bi.snu.ac.kr/

slide-61
SLIDE 61

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

28

Any Q Any Questi uestion?

  • n?

End of Lecture 20 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1