Clustering kMeans, Expectation Maximization, Self-Organizing Maps

Outline • K-means clustering • Hierarchical clustering • Incremental clustering • Probability-based clustering • Self-Organising Maps

Classification vs. Clustering Classification: Supervised learning (labels given)

Classification vs. Clustering labels unknown Clustering: Unsupervised learning No labels, find “natural” grouping of instances

Many Applications! • Basically, everywhere labels are unknown/ uncertain/too expensive • Marketing: find groups of similar customers • Astronomy: find groups of similar stars, galaxies • Earth-quake studies: cluster earth quake epicenters along continent faults • Genomics: find groups of genes with similar expressions

Clustering Methods: Terminology Non-overlapping Overlapping

Clustering Methods: Terminology Bottom-up Top-down (agglomerative)

Clustering Methods: Terminology Hierarchical

Clustering Methods: Terminology Deterministic Probabilistic

K-Means Clustering

K-means clustering (k=3) Y X Pick k random points: initial cluster centers

K-means clustering (k=3) k 1 Y k 2 k 3 X Pick k random points: initial cluster centers

K-means clustering (k=3) k 1 Y k 2 k 3 X Assign each point to nearest cluster center

K-means clustering (k=3) k 1 Y k 2 k 3 X Move cluster centers to mean of each cluster

K-means clustering (k=3) k 1 Y k 3 k 2 X Move cluster centers to mean of each cluster

K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

K-means clustering (k=3) k 1 Y k 3 k 2 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

K-means clustering (k=3) k 1 Y k 2 k 3 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

K-means Works with numeric data only Pick K random points: initial cluster centers 1) Assign every item to its nearest cluster center 2) (e.g. using Euclidean distance) Move each cluster center to the mean of its 3) assigned items Repeat steps 2,3 until convergence (change in 4) cluster assignments less than a threshold)

K-means clustering: another example http://www.youtube.com/watch?v=zaKjh2N8jN4#!

Discussion Result can vary significantly depending on initial • choice of centers Can get trapped in local minimum • Example: • initial cluster centers instances To increase chance of finding global optimum: restart • with different random seeds

K-means clustering summary Advantages Disadvantages • Must pick number of • Simple, understandable clusters before hand • Items automatically • All items forced into a single assigned to clusters cluster • Sensitive to outliers

K-means: variations • K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 1009 is • Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

K-means: variations • K-medoids – instead of mean, use medians of each cluster 205 Mean of 1, 3, 5, 7, 1009 is • Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

K-means: variations • K-medoids – instead of mean, use medians of each cluster 205 Mean of 1, 3, 5, 7, 1009 is • 5 Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

Hierarchical Clustering

Bottom-up vs top-down clustering • Bottom up / Agglomerative • Start with single-instance clusters At each step, join two “closest” clusters • A B C D E F A DE BC DEF B D BCDEF F E C ABCDEF • Top down • Start with one universal cluster Split in two clusters • • Proceed recursively on each subset

Hierarchical clustering • Hierarchical clustering represented in dendrogram • tree structure containing hierarchical clusters • clusters in leafs, union of child clusters in nodes

Distance Between Clusters Centroid: distance between centroids • Sometimes hard to compute (e.g. mean of molecules?) • Single Link : smallest distance between points • Complete Link: largest distance between points • Average Link: average distance between points • single sinlge link complete link average link distance = 1 distance = 2 distance = 1.5 (d(A,C)+d(A,D) +d(B,C)+d(B,D))/4 D D D C C C B B B A A A

Distance Between Clusters Centroid: distance between centroids • Sometimes hard to compute (e.g. mean of molecules?) • Single Link : smallest distance between points • Complete Link: largest distance between points • Average Link: average distance between points • Group-average : group two clusters into one, then take • average distance between all points (incl. d(A,B) & d(C,D))

Incremental Clustering

Clustering weather data ID Outlook Temp. Humidity Windy 1 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Clustering weather data ID Outlook Temp. Humidity Windy 1 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False start new clusters, 2 E Rainy Cool Normal False up to a point F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Category Utility Category utility: overall quality of clustering • Quadratic loss function • • nominal: clusters C i , attributes a i , values v ij : • numeric: similar, assume Gaussian distribution Intuitively: • • good clusters allow to predict value of new data points: Pr[a i =v ij | C i ] > Pr[a i =v ij ] • 1/k factor: penalty for using many clusters (avoids overfitting)

Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False Max. number D Rainy Mild High False 2 depends on k E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False Max. number D Rainy Mild High False 2 depends on k E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False join with most 3 similar leaf: J Rainy Mild Normal False new cluster K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Clustering weather data ID Outlook Temp. Humidity Windy 4 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Clustering kMeans, Expectation Maximization, Self-Organizing Maps - PowerPoint PPT Presentation

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means clustering Hierarchical clustering Incremental clustering Probability-based clustering Self-Organising Maps Classification vs. Clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Clustering Techniques Clustering Techniques Berlin Chen 2003 References: 1. Modern Information

INF4820: Algorithms for AI and NLP Clustering Milen Kouylekov & Stephan Oepen Language

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade University of Oxford November 7,

Geographic Data Science - Lecture VII Grouping Data over Space Dani Arribas-Bel Today The need

Clustering Data Clustering with user constraints The clustering problem : Given a set of

Hessian Aided Policy Gradient Z. Shen 1 , A. Ribeiro 2 , H. Hassani 2 , H. Qian 1 , C. Mi 1 1

Lecture #13 Acids & Bases: Polyprotics Benjamin, Chapter 4 (Stumm & Morgan, Chapt.3 )

PUBLIC HEALTH IN PRISON THE OREGON EXPERIENCE ANN CHAKWIN , PHD, MSW, MPH, MS