Clustering methods R.W. Oldford Interactive data visualization An - PowerPoint PPT Presentation

Clustering methods R.W. Oldford

Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be easily seen visually, even though it might be more difficult to describe mathematically. Moreover, the structure observed need not have been anticipated. Interaction which allows fairly arbitrary changes to the plot via direct manipulation (e.g. mouse gestures) and also via command line (i.e. programmatically), further enables the analyst, providing quick and easy data queries, marking of structure, and when the visualizations are themselves data structures quick setting and extraction of observed information. Direct interaction amplifies the advantage of data visualization and creates a powerful tool for uncovering structure. In contrast, we might choose to have some statistical algorithm search for structure in the data. This would of course require specifying in advance how that structure might be described mathematically.

Interactive data visualization The two approaches naturally complement one another. ◮ Structure searched for algorithmically must be precisely characterised mathematically, and so is necesssarily determined prior to the analysis. ◮ Interactive data visualization depends on the human visual system which has evolved over millions of years to be able to see patterns, both anticipated and not. In the hands of an experienced analyst, one complements and amplifies the other; the two are worked together to give much greater insight than either approach could alone. We have already seen the value of using both in conjunction with one another in, for example, hypothesis testing, density estimation, and smoothing.

Finding groups in data Consider the “Old Faithful” geyser data (from the MASS package), centred and scaled as follows. library (MASS) ## ## Attaching package: 'MASS' ## The following object is masked from 'package:dplyr': ## ## select xrange <- diff ( range (geyser $ duration)) yrange <- diff ( range (geyser $ waiting)) data <- as.data.frame ( scale (geyser[, c ("duration", "waiting")], scale = c (xrange, yrange))) data is now centred with the average in each direction and scaled so that the ranges of the two directions are identical. We do this so that when we consider the clustering methods, they will work on data and visual distances observed on any (square) scatterplot will correspond to Euclidean distances in the space of measurememts (which any clustering method would use).

Finding groups in data Oftentimes we observe that the data have grouped together in patterns. In a scatterplot, for example, we might notice that the observations concentrate more in some areas than they do in others. A simple scatterplot with larger 1 . 5 0.5 point sizes with alpha blending 2.5 1 shows 3 . 5 ◮ 3 regions of concentration, 2 3 ◮ 3 vertical lines, 4 . 5 3 . 5 ◮ and a few outliers. waiting 4 3 Contours of constant kernel 2 density estimate show 2.5 1 ◮ two modes at right, ◮ a higher mode at left, and 3 ◮ a smooth continuous 3.5 mathematical function Perhaps the points could be automatically grouped by using 2.5 the contours? 1 . 5 0.5 duration

Finding groups in data - K -means A great many methods exist (and continue to be developed) to automatically find groups in data. These have historically been called clustering methods by data analysts and more recently sometimes called unsupervised learning methods (in the sense that we do not know the “classes” of the observations as in “supervised learning”) by many artificial intelligence researchers. One of the earliest clustering methods is “ K -means”. In its simplest form, it begins with the knowledge that there are exactly K clusters to be determined. The idea is to identify K clusters, C 1 , . . . , C K , where every multivariate observation x i for i = 1 , . . . , n in the data set appears in one and only one cluster C k . The clusters are to be chosen so that the total within cluster spread is as small as possible. For every cluster, the total spread for that cluster is measured by the sum of squared Euclidean distances from the cluster “centroid”, namely � i ∈ C k d 2 ( i , k ) SSE k = � i ∈ C k || x i − c k || 2 = where c k is the cluster “centroid”. Typically, the cluster average x k = � i ∈ C k x i / n k (where n k denotes the cardinality of cluster C k ) is chosen as the cluster centroid (i.e. choose c k = x k ). The K clusters are chosen to minimize � K i =1 SSE k . Algorithms typically begin with “seed” centroids c 1 , . . . , c K , possibly randomly chosen, then assign every observation to its nearest centroid. Each centroid is then recalculated based on the values of x i ∀ i ∈ C k (e.g. c k = x k ), and the data are reassigned to the new centroids. Repeat until there is no change in the clustering.

Finding groups in data - K -means There are several implementations of K -means in R . Many of these are available via the base R function kmeans() . result <- kmeans (data, centers = 3) str (result) ## List of 9 ## $ cluster : Named int [1:299] 2 3 1 2 2 3 1 2 3 1 ... ## ..- attr(*, "names")= chr [1:299] "1" "2" "3" "4" ... ## $ centers : num [1:3, 1:2] 0.21 0.1392 -0.3166 -0.2655 0.0968 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:2] "duration" "waiting" ## $ totss : num 32 ## $ withinss : num [1:3] 1.33 1.19 1.58 ## $ tot.withinss: num 4.09 ## $ betweenss : num 27.9 ## $ size : int [1:3] 101 91 107 ## $ iter : int 2 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans" The cluster component identifies which of three clusters the corresponding obeservation has been assigned.

Finding groups in data - K -means Plotting this information in loon library (loon) p <- l_plot (data, linkingGroup = "geyser", showScales = FALSE, showLabels = FALSE, showGuides= # Add the density contours l_layer_contourLines (p, kde2d (data $ duration, data $ waiting, n=100), color = "grey") ## loon layer "lines" of type lines of plot .l0.plot ## [1] "layer0" # Colour the clusters p['color'] <- result $ cluster

Finding groups in data - K -means Plotting this information in loon plot (p) which looks pretty good.

Finding groups in data - K -means Had we selected only K = 2: which we might not completely agree with.

Finding groups in data - K -means How about K = 4?: which, again, we might or might not agree.

Finding groups in data - K -means Let’s try K = 6 again: which, is different!

Finding groups in data - K -means Some comments and questions: ◮ K means depends on total squared Euclidean distance to the centroids ◮ K means implicitly presumes that the clusters will be "globular" or "spherical" ◮ different clusters might arise on different calls (random starting position for centroids) ◮ how do we choose k ? ◮ should we enforce a hierarchy on the clusters? ◮ with an interactive visualization, we should be able to readjust the clusters by changing their colours

Finding groups in data - model based clustering A related approach to K means, but one that is much more general and which comes from a different reasoning base is that of so-called “model-based clustering”. Here, the main idea is that the data x i are a sample independently and identically distributed (iid) multivariate observations from some multivariate mixture distribution. That is X 1 , . . . , X n ∼ f p ( x ; Θ ) where f p ( x ; θ ) is a p -variate continuous density parameterized by some collection of parameters ˆ that can be expressed as a finite mixture of individual p -variate densities g p (): K � f p ( x ; Θ ) = α k g p ( x ; θ k ) . k =1 Here α k ≥ 0, � K k =1 α k = 1, the individual densities g p ( x ; θ k ) are of known shape and are identical up to differences given by their individual parameter vectors θ k . Neither α k nor θ k are known for any k = 1 , . . . , K and must be estimated from the observations (i.e. Θ = { α 1 , . . . , α K , θ 1 , . . . , θ K } ). Typically g p ( x , θ k ) are taken to be multivariate Gaussian densities of the form g p ( x ; θ k ) = φ p ( x ; µ k , Σ k ) with φ p ( x ; µ k , Σ k ) = (2 π ) − p 2 | Σ k | − p 2 ( x − µ k ) T Σ − 1 2 e − 1 ( x − µ k ) k and θ k = ( µ k , Σ k ).

Clustering methods R.W. Oldford Interactive data visualization An - PowerPoint PPT Presentation

Clustering methods R.W. Oldford Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

diameter, radius, discrete radius D : M M R distance function, S M , | S | <

October October October October 27 27 27-28, 28, 28, 28, 2014 2014 2014 2014 HHS,

Stratification and intergenerational Mobility in Africa - Examining Linkages with Pre-colonial

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics

Lab 8: 21 May 2012 Exercises on Clustering 1. Use the k-means algorithm and Euclidean distance to

Compiler construction Martin Steffen March 22, 2017 Contents 1 Abstract 1 1.1 Run-time

Implementing Procedure Calls February 1822, 2013 1 / 39 Outline Intro to procedure calls

EXPLOITING STRUCTURE FOR META-LEARNING NeurIPS Metalearning Workshop | December 8, 2018 Lise

Clustering methods R.W. Oldford Interactive data visualization An - PowerPoint PPT Presentation

Clustering methods R.W. Oldford Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

diameter, radius, discrete radius D : M M R distance function, S M , | S | &lt;

October October October October 27 27 27-28, 28, 28, 28, 2014 2014 2014 2014 HHS,

Stratification and intergenerational Mobility in Africa - Examining Linkages with Pre-colonial

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics

Lab 8: 21 May 2012 Exercises on Clustering 1. Use the k-means algorithm and Euclidean distance to

Compiler construction Martin Steffen March 22, 2017 Contents 1 Abstract 1 1.1 Run-time

Implementing Procedure Calls February 1822, 2013 1 / 39 Outline Intro to procedure calls

EXPLOITING STRUCTURE FOR META-LEARNING NeurIPS Metalearning Workshop | December 8, 2018 Lise

diameter, radius, discrete radius D : M M R distance function, S M , | S | <