Clustering Lecture notes Clustering is Exploratory, unsupervised - PowerPoint PPT Presentation

Clustering Lecture notes

Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise 70% 15% Hard vs Fuzzy/Soft clustering 13% 2%

Types of clusters Well separated Overlapping Contiguity Centre-based – based on distance to cluster centres. points are closer to others in their cluster than any other cluster Conceptual Density Sharing some general High-density areas attribute. Intersections separated by low-density belong to both areas

Notion of cluster ambiguous

Applications Biology – compare species in different environments Medicine – Group variants of illnesses or similar genes PET scans; differentiate tissues Marketing – Consumers with similar shopping habits, grouping shopping items Social sciences – Crime analysis (greatest incidences of different crimes), Poll data Computer science – Data compression, Anomaly detection Internet – Web searches (categories of results), social media analysis Other : Location of network towers for optimum signal cover Netflix (cluster similar view habits) Group skulls from archeological digs based on measurements

“Distance” metric • Minkowski metric Others metrics • Mahalanobis distance – Absolute without redundancies for n features, points x,y • Pearson correlation (unit indep.) – covariance(x,y) / [std(x) std(y)] Bad for high dimension data • Binary data: – Russel, Dice, Yule index, …. Two common cases: Manhattan, q = 1 • Cosine (Document: keywords) (cityblock) • Gowder’s distance (mixed) Euclidean, q = 2 • Alternatives (squared distances): (“as the crow flies”) ✴ Squared Euclidean Magnitude and units affect ✴ Squared Pearson (e.g. body height vs. toe length) ………. -> standardise ! (mean=0, std=1) But may affect variability

Linkage criteria Represent cluster location nearest(single) neighbour (noise,outliers) farthest(complete) neighbour (noise,outliers,favours global shape) average – calculate avg. amongst point dist. (middle of single/complete) centroid – virtual "average" point based on each feature medoid – real “median" point

Example calculation 4 B 3 2 A 1 1 2 3 4 5 6

Distance comparison Manhattan Euclidean Single 4 sqrt(8)=2.83 Complete 7 sqrt(29)=5.39 Average 48/9=5.33 4.00 Centroid 16/3 = 5.33 sqrt(153/9)=4.12 Medoid 6 sqrt(18)=4.24

Hierarchical (Hierarchical) tree of nested clusters Levels are steps in clustering process agglomerative vs divisive “bottom-up” vs “top-down” Time complexity at least n^2 Divisive often more computationally demanding

Dendrogram a 9 8 b 7 c 6 5 d 4 3 2 e Stepwise merge or split 1 1 2 3 4 5 6 7 8 9

Dendrogram Root Branch Root – starting point for point all points Branch point – Threshold splitting/merging point Leaf – point is in its own cluster Threshold – selected limit for best # clusters Leaf

Example calculation – Revisited 5 a b c 4 a b c d e a 0 1 4 5 5 3 b 1 0 2 6 6 d 2 c 4 2 0 7 7 e d 5 6 7 0 2 1 e 5 6 7 2 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

Example calculation – Revisited ab 5 c 4 ab c d e 3 ab 0 3.5 5.5 5.5 d c 3.5 0 7 7 2 d 5.5 7 0 2 e 1 e 5.5 7 2 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

Example calculation – Revisited ab 5 c 4 ab c de 3 ab 0 4.5 5 2 c 4.5 0 7 de 1 de 5 7 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

Example calculation – Revisited 5 abc c 4 3 2 de 1 1 2 3 4 5 6 Manhattan metric with centroid linkage

More detailed example

k-means Partitional clustering method Specify number of clusters k , find the most compact solution Ordinarily time complexity n dk+1 for d=features, k=clusters Faster algorithms exist (Lloyd's algorithm is linear!) But, k-means falls into local minima , several trials needed and only handles numeric data and redundancies are not excluded

k-means – Initialise Initialise k cluster centroids randomly – possibly different clusters, do multiple runs first random or average, then rest most distant point for centroids – may select outliers) k means++ (first is fully random, the others random with probability proportional to D 2 – distance to nearest centroid) then …….

k-means – Iterate Iterate (determine:) calculate centroids calculate distances to centroids move points to closest centroid Repeat until either: No points move clusters <X% moves Maximum iterations

k-means – issues Empty clusters – replace centroid with farthest point to clusters or from cluster with highest Sum of Square Errors Outliers – Points with high contribution to cluster SSE, and more But for some fields, like financial, outliers are important

Visualisation – k= 10, white crosses are centroids

Silhouette What k to use? Validates performance based on intra and inter-cluster distances a(i) average dissimilarity with other data in cluster b(i) lowest dissimilarity to any non-member cluster Values [-1,1] Calculated for each point, so very time demanding!

Calinski-Harabasz Index Faster , better for large data Performance based on average intra and inter-cluster SSE (Tr):

Postprocessing We may still want to improve SSE of our results • Split cluster with largest SSE or standard deviation • Open a new cluster using point most distant to any cluster Decrease number of clusters with smallest SSE increase • Disperse a cluster, reassign points to cluster increasing SSE the least • Merge two clusters with the closest centroids

Clustering Lecture notes Clustering is Exploratory, unsupervised - PowerPoint PPT Presentation

Clustering Lecture notes Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

defines whats learned Most instance-based schemes use Euclidean distance : a (1) and a (2) :

Cylinders Through Five Points: Computational Algebra and Geometry Daniel Lichtblau Wolfram

Nilsequences and the primes (The lack of) hidden patterns in the prime numbers Fields Medalists

Sample Complexity Bounds for Active Learning Paper by Sanjoy Dasgupta Presenter: Peter Sadowski

SHAPE ANALYSIS INEL 6088 Computer Vision Refs.: ch. 6, Davies; Ch. 2 Jain et al. TOPICS

Combining Features at Search Time: PRISMA at TRECVID 2011 Juan Manuel Barrios 1 , Benjamin Bustos

1. Lecture Motivation Digital images Syllabus Date Title Link 23.02. Introduction,

Non-Bayesian Classifiers Part I: k -Nearest Neighbor Classifier and Distance Functions Selim