LECTURE 8 The EM Algorithm Clustering Validation Sequence - PowerPoint PPT Presentation

DATA MINING LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation

CLUSTERING

What is a Clustering? • In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized

Clustering Algorithms • K-means and its variants • Hierarchical clustering • DBSCAN

MIXTURE MODELS AND THE EM ALGORITHM

Model-based clustering • In order to understand our data, we will assume that there is a generative process (a model) that creates/describes the data, and we will try to find the model that best fits the data. • Models of different complexity can be defined, but we will assume that our model is a distribution from which data points are sampled • Example: the data is the height of all people in Greece • In most cases, a single distribution is not good enough to describe all data points: different parts of the data follow a different distribution • Example: the data is the height of all people in Greece and China • We need a mixture model • Different distributions correspond to different clusters in the data.

Gaussian Distribution • Example: the data is the height of all people in Greece • Experience has shown that this data follows a Gaussian (Normal) distribution • Reminder: Normal distribution: 𝑓 − 𝑦−𝜈 2 1 2𝜏 2 𝑄 𝑦 = 2𝜌𝜏 • 𝜈 = mean, 𝜏 = standard deviation

Gaussian Model • What is a model? • A Gaussian distribution is fully defined by the mean 𝜈 and the standard deviation 𝜏 • We define our model as the pair of parameters 𝜄 = (𝜈, 𝜏) • This is a general principle: a model is defined as a vector of parameters 𝜄

Fitting the model • We want to find the normal distribution that best fits our data • Find the best values for 𝜈 and 𝜏 • But what does best fit mean?

Maximum Likelihood Estimation (MLE) • Suppose that we have a vector 𝑌 = (𝑦 1 , … , 𝑦 𝑜 ) of values and we want to fit a Gaussian 𝑂(𝜈, 𝜏) model to the data • Probability of observing point 𝑦 𝑗 : 𝑓 − 𝑦 𝑗 −𝜈 2 1 𝑄 𝑦 𝑗 = 2𝜏 2 2𝜌𝜏 • Probability of observing all points (assume independence) 𝑜 𝑜 𝑓 − 𝑦 𝑗 −𝜈 2 1 2𝜏 2 𝑄 𝑌 = 𝑄 𝑦 𝑗 = 2𝜌𝜏 𝑗=1 𝑗=1 • We want to find the parameters 𝜄 = (𝜈, 𝜏) that maximize the probability 𝑄(𝑌|𝜄)

Maximum Likelihood Estimation (MLE) • The probability 𝑄(𝑌|𝜄) as a function of 𝜄 is called the Likelihood function 𝑜 𝑓 − 𝑦 𝑗 −𝜈 2 1 𝑀(𝜄) = 2𝜏 2 2𝜌𝜏 𝑗=1 • It is usually easier to work with the Log-Likelihood function 𝑜 𝑦 𝑗 − 𝜈 2 − 1 𝑀𝑀 𝜄 = − 2 𝑜 log 2𝜌 − 𝑜 log 𝜏 2𝜏 2 𝑗=1 • Maximum Likelihood Estimation • Find parameters 𝜈, 𝜏 that maximize 𝑀𝑀(𝜄) 𝑜 𝑜 𝜈 = 1 𝜏 2 = 1 (𝑦 𝑗 −𝜈) 2 = 𝜏 𝑌 2 𝑜 𝑦 𝑗 = 𝜈 𝑌 𝑜 𝑗=1 𝑗=1 Sample Mean Sample Variance

MLE • Note: these are also the most likely parameters given the data 𝑄 𝜄 𝑌 = 𝑄 𝑌 𝜄 𝑄(𝜄) 𝑄(𝑌) • If we have no prior information about 𝜄 , or X, then maximizing 𝑄 𝑌 𝜄 is the same as maximizing 𝑄 𝜄 𝑌

Mixture of Gaussians • Suppose that you have the heights of people from Greece and China and the distribution looks like the figure below (dramatization)

Mixture of Gaussians • In this case the data is the result of the mixture of two Gaussians • One for Greek people, and one for Chinese people • Identifying for each value which Gaussian is most likely to have generated it will give us a clustering.

Mixture model • A value 𝑦 𝑗 is generated according to the following process: • First select the nationality • With probability 𝜌 𝐻 select Greece, with probability 𝜌 𝐷 select China (𝜌 𝐻 + 𝜌 𝐷 = 1) We can also thing of this as a Hidden Variable Z that takes two values: Greece and China • Given the nationality, generate the point from the corresponding Gaussian • 𝑄 𝑦 𝑗 𝜄 𝐻 ~ 𝑂 𝜈 𝐻 , 𝜏 𝐻 if Greece 𝜄 𝐻 : parameters of the Greek distribution 𝜄 𝐷 : parameters of the China distribution • 𝑄 𝑦 𝑗 𝜄 𝐷 ~ 𝑂 𝜈 𝐷 , 𝜏 𝐷 if China

Mixture Model • Our model has the following parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜏 𝐻 , 𝜈 𝐷 , 𝜏 𝐷 ) 𝜄 𝐻 : parameters of the Greek distribution Mixture probabilities 𝜄 𝐷 : parameters of the China distribution

Mixture Model • Our model has the following parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜏 𝐻 , 𝜈 𝐷 , 𝜏 𝐷 ) Mixture probabilities Distribution Parameters • For value 𝑦 𝑗 , we have: 𝑄 𝑦 𝑗 |Θ = 𝜌 𝐻 𝑄 𝑦 𝑗 𝜄 𝐻 + 𝜌 𝐷 𝑄(𝑦 𝑗 |𝜄 𝐷 ) • For all values 𝑌 = 𝑦 1 , … , 𝑦 𝑜 𝑜 𝑄 𝑌|Θ = 𝑄(𝑦 𝑗 |Θ) 𝑗=1 • We want to estimate the parameters that maximize the Likelihood of the data

Mixture Models • Once we have the parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜈 𝐷 , 𝜏 𝐻 , 𝜏 𝐷 ) we can estimate the membership probabilities 𝑄 𝐻 𝑦 𝑗 and 𝑄 𝐷 𝑦 𝑗 for each point 𝑦 𝑗 : • This is the probability that point 𝑦 𝑗 belongs to the Greek or the Chinese population (cluster) Given from the Gaussian distribution 𝑂(𝜈 𝐻 , 𝜏 𝐻 ) for Greek 𝑄 𝑦 𝑗 𝐻 𝑄(𝐻) 𝑄 𝐻 𝑦 𝑗 = 𝑄 𝑦 𝑗 𝐻 𝑄 𝐻 + 𝑄 𝑦 𝑗 𝐷 𝑄(𝐷) 𝑄 𝑦 𝑗 𝜄 𝐻 𝜌 𝐻 = 𝑄 𝑦 𝑗 𝜄 𝐻 𝜌 𝐻 + 𝑄 𝑦 𝑗 𝜄 𝐷 𝜌 𝐷

EM (Expectation Maximization) Algorithm • Initialize the values of the parameters in Θ to some random values • Repeat until convergence • E-Step: Given the parameters Θ estimate the membership probabilities 𝑄 𝐻 𝑦 𝑗 and 𝑄 𝐷 𝑦 𝑗 • M-Step: Compute the parameter values that (in expectation) maximize the data likelihood 𝑜 𝑜 𝜌 𝐷 = 1 𝜌 𝐻 = 1 Fraction of 𝑜 𝑄(𝐷|𝑦 𝑗 ) 𝑜 𝑄(𝐻|𝑦 𝑗 ) population in G,C 𝑗=1 𝑗=1 𝑜 𝑄 𝐷 𝑦 𝑗 𝑜 𝑄 𝐻 𝑦 𝑗 𝜈 𝐷 = 𝑦 𝑗 MLE Estimates 𝜈 𝐻 = 𝑦 𝑗 𝑜 ∗ 𝜌 𝐷 𝑜 ∗ 𝜌 𝐻 if 𝜌 ’s were fixed 𝑗=1 𝑗=1 𝑜 𝑄 𝐷 𝑦 𝑗 𝑜 𝑄 𝐻 𝑦 𝑗 2 = 2 = 𝑦 𝑗 − 𝜈 𝐷 2 𝑦 𝑗 − 𝜈 𝐻 2 𝜏 𝐷 𝜏 𝐻 𝑜 ∗ 𝜌 𝐷 𝑜 ∗ 𝜌 𝐻 𝑗=1 𝑗=1

Relationship to K-means • E-Step: Assignment of points to clusters • K-means: hard assignment, EM: soft assignment • M-Step: Computation of centroids • K-means assumes common fixed variance (spherical clusters) • EM: can change the variance for different clusters or different dimensions (ellipsoid clusters) • If the variance is fixed then both minimize the same error function

CLUSTERING EVALUATION

Clustering Evaluation • How do we evaluate the “ goodness ” of the resulting clusters? • But “ clustering lies in the eye of the beholder ”! • Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clusterings, or clustering algorithms • To compare against a “ ground truth ”

Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Random DBSCAN 0.6 0.6 Points y 0.5 0.5 y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 0.8 0.8 K-means Complete 0.7 0.7 Link 0.6 0.6 0.5 0.5 y y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x

Different Aspects of Cluster Validation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of two different sets of cluster analyses to determine which is better. Determining the ‘correct’ number of clusters . 5. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

LECTURE 8 The EM Algorithm Clustering Validation Sequence - PowerPoint PPT Presentation

DATA MINING LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Required Tutorial Eiffel Testing Framework (ETF): Automated Regression & Acceptance Testing

Introduction to Programming paradigms different perspectives (to try) to solve problems 17

Jeffrey D. Ullman Stanford University Given a set of points, with a notion of distance

Object oriented Analysis and Design Dr. Onno van Roosmalen L A T EX version: Pieter van den

DATA MINING LECTURE 9 The EM Algorithm Clustering Evaluation Sequence segmentation CLUSTERING

Transparent Assessment Providing transparent goals and expectations for students Jonathon Adams

TSP: operational semantics / department of mathematics and computer science 3/15 / department of

Reconciling Concurrency Theory with Other Branches of Computer Science Hubert Garavel Inria

LECTURE 8 The EM Algorithm Clustering Validation Sequence - PowerPoint PPT Presentation

DATA MINING LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Required Tutorial Eiffel Testing Framework (ETF): Automated Regression &amp; Acceptance Testing

Introduction to Programming paradigms different perspectives (to try) to solve problems 17

Jeffrey D. Ullman Stanford University Given a set of points, with a notion of distance

Object oriented Analysis and Design Dr. Onno van Roosmalen L A T EX version: Pieter van den

DATA MINING LECTURE 9 The EM Algorithm Clustering Evaluation Sequence segmentation CLUSTERING

Transparent Assessment Providing transparent goals and expectations for students Jonathon Adams

TSP: operational semantics / department of mathematics and computer science 3/15 / department of

Reconciling Concurrency Theory with Other Branches of Computer Science Hubert Garavel Inria

Required Tutorial Eiffel Testing Framework (ETF): Automated Regression & Acceptance Testing