Clustering Ciira Maina Dedan Kimathi University of Technology 17th - PowerPoint PPT Presentation

Clustering Ciira Maina Dedan Kimathi University of Technology 17th June 2015

Introduction ◮ In most data science applications we are start off with a large collection of objects which form our data set. ◮ Clustering is often an initial exploratory operation applied to the data. ◮ The aim of clustering is the grouping of objects into subsets with closely related objects in the same group or cluster.

Introduction Sheep vs. Goats [Source Wikipedia]

Introduction Apples vs. Oranges [Source: http://www.microassist.com/ ]

Introduction ◮ Clustering has a number of applications such as: ◮ Image segmentation for lossy image compression ◮ Audio processing applications like diarization and voice activity detection ◮ Clustering gene expression data ◮ Wireless network base station cooperation

Introduction ◮ Here we will consider a number of clustering algorithms: ◮ K-means clustering ◮ Gaussian mixture modelling ◮ Hierachical clustering

K-means ◮ Given a set of N data points, the goal of K-means clustering is to assign each data point to one of K groups ◮ Each cluster is characterised by a cluster mean µ k k = 1 , . . . , K ◮ The data points are assigned to the clusters such that the average dissimilarity of data points in the cluster from the cluster mean is minimized. ◮ In K-means clustering the dissimilarity is measured using Euclidean distance

K-means, Example ◮ Consider 2D data from two distinct clusters. K-means does a good job of discovering these clusters. 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Figure: Result of K-means Figure: Data with two distinct clustering clusters

K-means, The Theory ◮ Consider the N data points { x 1 , . . . , x N } which we would like to partition into K clusters. ◮ We introduce K cluster centers µ k k = 1 , . . . , K and corresponding indicator variables r n , k ∈ { 0 , 1 } where r n , k = 1 if x n belongs to cluster k . ◮ The objective function is the sum of square distances of the data points to assigned cluster centers. That is N K � � r n , k || x n − µ k || 2 J = n =1 k =1

K-means, The Theory 1. The K-means algorithm proceeds iteratively. Starting with an initial set of cluster centers, the variables r n , k are determined. � 1 if k = argmin j || x n − µ j || 2 r n , k = 0 otherwise 2. In the next step, the cluster centers are updated based on the current assignment � n r n , k x n µ k = � n r n , k 3. Step 1 and 2 are repeated until the assignment remains unchanged or the relative change in J is small.

K-means, Example 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Figure: Data with two distinct Figure: Randomly initialize the clusters cluster centers

K-means, Example 5 5 4 4 3 3 2 2 1 1 0 0 1 2 3 4 5 0 0 1 2 3 4 5 Figure: Assign data points to Figure: Recompute cluster centers cluster centers

K-means, Example ◮ To determine when to stop K-means, we monitor the cost function J . ◮ In this case, 3 iterations are sufficient 30 25 20 15 J 10 5 0 0 1 2 3 4 5 6 7 Iteration

K-means, Image compression Example ◮ K-means clustering can be used in image compression using vector quantization. ◮ This algorithm takes advantage of the fact that several nearby pixels of an image often appear the same. ◮ The image is divided into blocks which are then clustered using K-means. ◮ The blocks are then represented using the centroids of the clusters to which they belong.

K-means, Image compression Example ◮ In this example we start with a 196-by-196 pixel image of Mzee Jomo Kenyatta ◮ We divide the image into 2-by-2 blocks and treat these blocks as vectors in R 4 ◮ These vectors are clustered with K = 100 and K = 10 ◮ The resulting image shows degradation but uses fewer bytes for storage Figure: Original Image Figure: VQ with 100 Figure: VQ with 10 classes classes

K-means, Image compression Example ◮ The original image requires 196 × 196 × 8 bits. ◮ To store the cluster to which each 2 × 2 block belongs to we require log 2 ( K ) bits ◮ To store the cluster centers we need K × 4 real numbers ◮ The total storage for the compressed image is log 2 ( K ) × # blocks = log 2 ( K ) × 196 2 4 ◮ When K = 10, we can compress the image to log 2 (10) = 0 . 103 32 of its original size

K-means, Practical Issues 1. To avoid local minima we should have multiple random initializations. 2. Initial cluster centers chosen randomly from the data points. 3. Choosing K - Elbow method.

Gaussian Mixture Models ◮ So far we have considered situations where each data point is assigned to only one cluster. ◮ This is sometimes referred to as hard clustering ◮ In several cases it may be more approriate to consider assigning each data point a probability of membership to each cluster. ◮ This is soft clustering ◮ Gaussian Mixture Models are useful for soft clustering

Gaussian Mixture Models ◮ GMMs are ideal for modelling continuous data that can be grouped into distinct clusters. ◮ For example consider a speech signal which contains regions with speech and other regions with silence ◮ We could use a GMM to decide which category a certain segment belongs to. 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0 2 4 6 8 10 Time (seconds)

Gaussian Mixture Models, VAD Example ◮ Voice activity detection is a useful signal processing application ◮ It involves deciding whether a speech segment is speech or silence ◮ We divide the speech into short segments and compute the logarithm of the energy of each segment. ◮ We see that the log energy shows distinct clusters. 40 35 30 25 20 15 10 5 0 6 5 4 3 2 1 0 1 2 3 Logarithm of block energy

Gaussian Mixture Models, VAD Example ◮ A single Gaussian does not fit the data well 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6 5 4 3 2 1 0 1 2 3 Logarithm of block energy

Gaussian Mixture Models, VAD Example ◮ Two Gaussians do a better job 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6 5 4 3 2 1 0 1 2 3 Logarithm of block energy

Gaussian Mixture Models, VAD Example ◮ Are three Gaussians even better? 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6 5 4 3 2 1 0 1 2 3 Logarithm of block energy

Gaussian Mixture Models, Theory ◮ The Gaussian distribution function for a 1D variable is given by 1 1 � 2 σ 2 ( x − µ ) 2 � − p ( x ) = exp � (2 πσ 2 ) ◮ The distribution is governed by two parameters ◮ The mean µ ◮ The variance σ 2 ◮ The mean determines where the distribution is centered and the variance determines the spread of the distribution around this mean.

Gaussian Mixture Models, Theory Univariate Gaussian with µ =0 and σ =1 Univariate Gaussian with µ =1 and σ =0 . 5 0.5 1.0 0.4 0.8 0.3 0.6 p ( x ) p ( x ) 0.2 0.4 0.1 0.2 0.0 0.0 4 4 4 4 6 2 0 2 6 6 2 0 2 6 x x

Gaussian Mixture Models, Theory ◮ The Gaussian density can not be used to model data with more than one distinct ‘clump’ like the log energy of the speech frames. ◮ Linear combinations of more than one Gaussian can capture this structure. ◮ These distributions are known as Gaussian Mixture Models (GMMs) or Mixture of Gaussians

Gaussian Mixture Models, Theory ◮ The GMM density takes the form K � π k N ( x | µ k , σ k ) p ( x ) = k =1 ◮ π k is known as a mixing coefficient. We have K � π k = 1 k =1 and 0 ≤ π k ≤ 1

Gaussian Mixture Models, Theory ◮ A GMM with three mixture components 0.35 0.30 0.25 0.20 p ( x ) 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x

Gaussian Mixture Models, Theory ◮ The mixing coefficients can be viewed as the prior probability of the components of the mixture ◮ We can then use the sum and product rules and write K � p ( k ) p ( x | k ) p ( x ) = k =1 ◮ Where p ( k ) = π k and p ( x | k ) = N ( x | µ k , σ k )

Gaussian Mixture Models, Theory ◮ Given an observation x , we will be interested to compute the posterior probability of each component that is p ( k | x ) ◮ We use Bayes’ rule p ( x | k ) p ( k ) p ( k | x ) = p ( x ) p ( x | k ) p ( k ) = � i p ( x | i ) p ( i ) ◮ We can use this posterior to build a classifier

Gaussian Mixture Models, Learning the model ◮ Given a set of observations X = { x 1 , x 2 , . . . , x N } where the observations are assumed to be drawn independently from a GMM, the log likelihood function is given by N � K � � � π k N ( x i | µ k , σ k ) ℓ ( θ ; X ) = log n =1 k =1 where θ = { π 1 , . . . , π K , µ 1 , . . . , µ K , σ 2 1 , . . . , σ 2 K } are the parameters of the GMM. ◮ To obtain a maximum likelihood estimate of the parameters, we use the expectation maximization (EM) algorithm

Gaussian Mixture Models, Returning to the VAD Example ◮ In the VAD example we use the implementation of EM in scikit-learn. ◮ We can then compute the posterior probability of all segments belonging to the component with the highest mean. ◮ Segments where this probability is greater than a threshold can be classified as speech.

Clustering Ciira Maina Dedan Kimathi University of Technology 17th - PowerPoint PPT Presentation

Clustering Ciira Maina Dedan Kimathi University of Technology 17th June 2015 Introduction In most data science applications we are start off with a large collection of objects which form our data set. Clustering is often an initial

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Costcompetitive Reduction of Carbon Emissions of up to 80% from the US Electric Sector by 2030

Lecture III: Majorana neutrinos Petr Vogel, Caltech NLDBD school, October 31, 2017 Whatever

Efficient Voice Activity Detection via Binarized Neural Networks Jong Hwan Ko Josh Fromm

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

Dark soliton in a disorder potential Magorzata Mochol , Marcin Podzie, Krzysztof Sacha

Some challenges of four-dimensional data assimilation problems Juan Carlos De los Reyes Centro

MAKE DKOM ATTACKS GREAT AGAIN MARIANO GRAZIANO Bologna, Italy - 29/10/2016 1 whoami

Pampa & Edenor Joint Q4 17 Results Call March 13, 2017 10 AM EST / 11 AM ART Disclaimer

Clustering Ciira Maina Dedan Kimathi University of Technology 17th - PowerPoint PPT Presentation

Clustering Ciira Maina Dedan Kimathi University of Technology 17th June 2015 Introduction In most data science applications we are start off with a large collection of objects which form our data set. Clustering is often an initial

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Costcompetitive Reduction of Carbon Emissions of up to 80% from the US Electric Sector by 2030

Lecture III: Majorana neutrinos Petr Vogel, Caltech NLDBD school, October 31, 2017 Whatever

Efficient Voice Activity Detection via Binarized Neural Networks Jong Hwan Ko Josh Fromm

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

Dark soliton in a disorder potential Magorzata Mochol , Marcin Podzie, Krzysztof Sacha

Some challenges of four-dimensional data assimilation problems Juan Carlos De los Reyes Centro

MAKE DKOM ATTACKS GREAT AGAIN MARIANO GRAZIANO Bologna, Italy - 29/10/2016 1 whoami

Pampa &amp; Edenor Joint Q4 17 Results Call March 13, 2017 10 AM EST / 11 AM ART Disclaimer

Pampa & Edenor Joint Q4 17 Results Call March 13, 2017 10 AM EST / 11 AM ART Disclaimer