Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, - PowerPoint PPT Presentation

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005] COMP24111 Machine Learning

Outline • Introduction • Cluster Distance Measures • Agglomerative Algorithm • Example and Demo • Key Concepts in Hierarchal Clustering • Clustering Ensemble via Evidence Accumulation • Summary COMP24111 Machine Learning 2

Introduction • Hierarchical Clustering Approach – A typical clustering analysis approach via partitioning data set sequentially – Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) – Use (generalised) distance matrix as clustering criteria • Agglomerative vs. Divisive – Agglomerative: a bottom-up strategy  Initially each data object is in its own (atomic) cluster  Then merge these atomic clusters into larger and larger clusters – Divisive: a top-down strategy  Initially all objects are in one single cluster  Then the cluster is subdivided into smaller and smaller clusters • Clustering Ensemble – Using multiple clustering results for robustness and overcoming weaknesses of single clustering algorithms. COMP24111 Machine Learning 3

Introduction: Illustration • Illustrative Example: Agglomerative vs. Divisive Agglomerative and divisive clustering on the data set { a, b, c, d ,e } Step 1 Step 2 Step 3 Step 4 Step 0 Agglomerative a a b b  Cluster distance a b c d e  Termination condition c c d e d d e e Divisive Step 3 Step 2 Step 1 Step 0 Step 4 COMP24111 Machine Learning 4

Cluster Distance Measures single link • Single link: smallest distance (min) between an element in one cluster and an element in the other, i.e., d(C i , C j ) = min{ d(x ip , x jq )} complete link • Complete link: largest distance (max) between an element in one cluster and an element in the other, i.e., d(C i , C j ) = max{ d(x ip , x jq )} average • Average: avg distance between elements in one cluster and elements in the other, i.e., d(C i , C j ) = avg{ d(x ip , x jq )} d(C, C)=0 COMP24111 Machine Learning 5

Cluster Distance Measures Example : Given a data set of five objects characterised by a single continuous feature, assume that there are two clusters: C 1 : { a, b} and C 2 : { c, d, e} . a b c d e Feature 1 2 4 5 6 1. Calculate the distance matrix . 2. Calculate three cluster distances between C 1 and C 2 . Single link a b c d e = dist ( C , C ) min{ d ( a, c) , d ( a, d), d (a, e), d (b, c), d (b, d), d (b, e)} 1 2 = = min{3, 4, 5, 2, 3, 4} 2 a 0 1 3 4 5 Complete link b 1 0 2 3 4 = dist(C , C ) max{ d ( a, c) , d ( a, d), d (a, e), d (b, c), d (b, d), d (b, e)} 1 2 c 3 2 0 1 2 = = max{3, 4, 5, 2, 3, 4} 5 d 4 3 1 0 1 Average + + + + + d ( a, c) d ( a, d) d (a, e) d (b, c) d (b, d) d (b, e) = dist(C , C ) e 5 4 2 1 0 1 2 6 + + + + + 3 4 5 2 3 4 21 = = = 3 . 5 6 6 COMP24111 Machine Learning 6

Agglomerative Algorithm • The Agglomerative algorithm is carried out in three steps: 1) Convert all object features into a distance matrix 2) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters)  Merge two closest clusters  Update “distance matrix” COMP24111 Machine Learning 7

Example • Problem: clustering analysis with agglomerative algorithm data matrix Euclidean distance distance matrix COMP24111 Machine Learning 8

Example • Merge two closest clusters (iteration 1) COMP24111 Machine Learning 9

Example • Update distance matrix (iteration 1) COMP24111 Machine Learning 10

Example • Merge two closest clusters (iteration 2) COMP24111 Machine Learning 11

Example • Update distance matrix (iteration 2) COMP24111 Machine Learning 12

Example • Merge two closest clusters/update distance matrix (iteration 3) COMP24111 Machine Learning 13

Example • Merge two closest clusters/update distance matrix (iteration 4) COMP24111 Machine Learning 14

Example • Final result (meeting termination condition) COMP24111 Machine Learning 15

Key Concepts in Hierarchal Clustering • Dendrogram tree representation 1. In the beginning we have 6 clusters: A, B, C, D, E and F 2. We merge clusters D and F into 6 cluster (D, F) at distance 0.50 3. We merge cluster A and cluster B into (A, B) at distance 0.71 lifetime 4. We merge clusters E and (D, F) into ((D, F), E) at distance 1.00 5 5. We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 4 6. We merge clusters (((D, F), E), C) 3 and (A, B) into ((((D, F), E), C), (A, B)) 2 at distance 2.50 7. The last cluster contain all the objects, thus conclude the computation object COMP24111 Machine Learning 16

Key Concepts in Hierarchal Clustering • Lifetime vs K -cluster Lifetime Lifetime • The distance between that a cluster is created and that it disappears (merges with other clusters during clustering). 6 e.g. lifetime of A, B, C, D, E and F are 0.71, 0.71, 1.41, 0.50, 1.00 and 0.50, respectively, the life time of (A, B) is 2.50 – 0.71 = 1.79, …… lifetime K -cluster Lifetime • The distance from that K clusters emerge to that K clusters 5 vanish (due to the reduction to K-1 clusters). 4 e.g. 3 5-cluster lifetime is 0.71 - 0.50 = 0.21 2 4-cluster lifetime is 1.00 - 0.71 = 0.29 3-cluster lifetime is 1.41 – 1.00 = 0.41 2-cluster lifetime is 2.50 – 1.41 = 1.09 object COMP24111 Machine Learning 17

Demo Agglomerative Demo COMP24111 Machine Learning 18

Relevant Issues • How to determine the number of clusters – If the number of clusters known, termination condition is given! – The K -cluster lifetime as the range of threshold value on the dendrogram tree that leads to the identification of K clusters – Heuristic rule: cut a dendrogram tree with maximum life time to find a “proper” K • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers Less efficient: O ( n 2 logn ), where n is the number of total objects – • There are several variants to overcome its weaknesses – BIRCH: scalable to a large data set – ROCK: clustering categorical data – CHAMELEON: hierarchical clustering using dynamic modelling COMP24111 Machine Learning 19

Clustering Ensemble • Motivation – A single clustering algorithm may be affected by various factors  Sensitive to initialisation and noise/outliers, e.g. the K-means is sensitive to initial centroids!  Sensitive to distance metrics but hard to find a proper one  Hard to decide a single best algorithm that can handle all types of cluster shapes and sizes – An effective treatments: clustering ensemble  Utilise the results obtained by multiple clustering analyses for robustness COMP24111 Machine Learning 20

Clustering Ensemble • Clustering Ensemble via Evidence Accumulation (Fred & Jain, 2005) – A simple clustering ensemble algorithm to overcome the main weaknesses of different clustering methods by exploiting their synergy via evidence accumulation • Algorithm summary – Initial clustering analysis by using either different clustering algorithms or running a single clustering algorithm on different conditions, leading to multiple partitions e.g. the K-mean with various initial centroid settings and different K, the agglomerative algorithm with different distance metrics and forced to terminated with different number of clusters… – Converting clustering results on different partitions into binary “distance” matrices – Evidence accumulation: form a collective “distance” matrix based on all the binary “distance” matrices – Apply a hierarchical clustering algorithm (with a proper cluster distance metric) to the collective “distance” matrix and use the maximum K -cluster lifetime to decide K COMP24111 Machine Learning 21

Clustering Ensemble  Example: convert clustering results into binary “Distance” matrix Cluster 2 (C2) “distance” Matrix D C B C D A   A 0 0 1 1   Cluster 1 (C1) B 0 0 1 1   = D   A B 1 C 1 1 0 0   D   1 1 0 0 22 COMP24111 Machine Learning

Clustering Ensemble  Example: convert clustering results into binary “Distance” matrix Cluster 3 (C3) D “distance Matrix” Cluster 2 (C2) C B C D A   A 0 0 1 1   Cluster 1 (C1) 0 0 1 1 B   = D   2 A B 1 1 0 1 C     1 1 1 0 D 23 COMP24111 Machine Learning

Clustering Ensemble  Evidence accumulation: form the collective “distance” matrix     0 0 1 1 0 0 1 1     0 0 1 1 0 0 1 1     = = D D     1 2 1 1 0 0 1 1 0 1         1 1 0 0 1 1 1 0   0 0 2 2   0 0 2 2   = + = D C D D   1 2 2 2 0 1     2 2 1 0 24 COMP24111 Machine Learning

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, - PowerPoint PPT Presentation

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005] COMP24111 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example and

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Hierarchical clustering David M. Blei COS424 Princeton University February 28, 2008 D. Blei

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Hierarchical Clustering 4-4-16 Hierarchical clustering: the setting Unsupervised learning

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

RECSM Summer School: Machine Learning for Social Sciences Session 3.4: Hierarchical Clustering

Clustering: Hierarchical Clustering and K- Means Clustering Machine

Chapter 7: Clustering (Unsupervised Data Organization) 7.1 Hierarchical Clustering 7.2 Flat

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

Sys System tem Eva Evaluat luation ion Darren Urada, Ph.D. and Howard Padwa, Ph.D., and

Cancer Incidence in Nigeria: a report from population-based Cancer Registries Jedy-Agba E,

Suicide Prevention and Intervention in Colorado Mental Health and Substance Abuse Webinar

Evaluation function Cost function g g Evaluation function Cost function expand vertex

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Lecture 21: Unsupervised Learning and Clustering Algorithms Dr. Chengjiang Long Computer Vision

Lecture 22: Clustering Distance measures K-Means Aykut Erdem May 2016 Hacettepe

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info