Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade University of Oxford November 7, 2017

Outline This week, we will study some approaches to clustering ◮ Defining an objective function for clustering ◮ k -Means formulation for clustering ◮ Multidimensional Scaling ◮ Hierarchical clustering ◮ Spectral clustering 1

England pushed towards Test defeat by India France election: Socialists scramble to avoid split after Fillon win Giants Add to the Winless Browns’ Misery Strictly Come Dancing: Ed Balls leaves programme Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally Vive ‘La Binoche’, the reigning queen of French cinema 2

Sports England pushed towards Test defeat by India Politics France election: Socialists scramble to avoid split after Fillon win Sports Giants Add to the Winless Browns’ Misery Film&TV Strictly Come Dancing: Ed Balls leaves programme Politics Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally Film&TV Vive ‘La Binoche’, the reigning queen of French cinema 2

England England pushed towards Test defeat by India France France election: Socialists scramble to avoid split after Fillon win USA Giants Add to the Winless Browns’ Misery England Strictly Come Dancing: Ed Balls leaves programme USA Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally France Vive ‘La Binoche’, the reigning queen of French cinema 2

Clustering Often data can be grouped together into subsets that are coherent. However, this grouping may be subjective. It is hard to define a general framework. Two types of clustering algorithms 1. Feature-based - Points are represented as vectors in R D 2. (Dis)similarity-based - Only know pairwise (dis)similarities Two types of clustering methods 1. Flat - Partition the data into k clusters 2. Hierarchical - Organise data as clusters, clusters of clusters, and so on 3

Defining Dissimilarity ◮ Weighted dissimilarity between (real-valued) attributes   D d ( x , x ′ ) = f � w i d i ( x i , x ′ i )   i =1 i ) 2 and f ( z ) = z , ◮ In the simplest setting w i = 1 and d i ( x i , x ′ i ) = ( x i − x ′ which corresponds to the squared Euclidean distance ◮ Weights allow us to emphasise features differently ◮ If features are ordinal or categorical then define distance suitably ◮ Standardisation (mean 0 , variance 1 ) may or may not help 4

Helpful Standardisation 5

Unhelpful Standardisation 6

Partition Based Clustering Want to partition the data into subsets C 1 , . . . , C k , where k is fixed in advance Define quality of a partition by k W ( C ) = 1 1 � � d ( x i , x i ′ ) 2 | C j | j =1 i,i ′ ∈ C j If we use d ( x , x ′ ) = � x − x ′ � 2 , then k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j where µ j = 1 � i ∈ C j x i | C j | The objective is minimising the sum of squares of distances to the mean within each cluster 7

Outline Clustering Objective k -Means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Partition Based Clustering : k -Means Objective Minimise jointly over partitions C 1 , . . . , C k and µ 1 , . . . , µ k k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j This problem is NP-hard even for k = 2 for points in R D If we fix µ 1 , . . . , µ j , finding a partition ( C j ) k j =1 that minimises W is easy C j = { i | � x i − µ j � = min j ′ � x i − µ j ′ �} If we fix the clusters C 1 , . . . , C k minimising W with respect to ( µ j ) k j =1 is easy 1 � µ j = x i | C j | i ∈ C j Iteratively run these two steps - assignment and update 8

k -Means Clusters ( k = 3) Ground Truth Clusters 10

The k -Means Algorithm 1. Intialise means µ 1 , . . . , µ k ‘‘randomly’’ 2. Repeat until convergence: a. Find assignments of data to clusters represented by the mean that is closest to obtain, C 1 , . . . , C k : � x i − µ j ′ � 2 } C j = { i | j = argmin j ′ b. Update means using the current cluster assignments: 1 � µ j = x i | C j | i ∈ C j Note 1: Ties can be broken arbitrarily Note 2: Choosing k random datapoints to be the initial k -means is a good idea 11

The k -Means Algorithm Does the algorithm always converge? Yes, because the W function decreases every time a new partition is used; there are only finitely many partitions k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j Convergence may be very slow in the worst-case, but typically fast on real-world instances Convergence is probably to a local minimum. Run multiple times with random initialisation. Can use other criteria: k -medoids, k -centres, etc. Selecting the right k is not easy: plot W against k and identify a "kink" 12

k -Means Clusters ( k = 4 ) Ground Truth Clusters 13

Choosing the number of clusters k MSE on test vs K for K−means 0.25 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 ◮ As in the case of PCA, larger k will give better value of the objective ◮ Choose suitable k by identifying a ‘‘kink’’ or ‘‘elbow’’ in the curve (Source: Kevin Murphy, Chap 11) 14

Multidimensional Scaling (MDS) In certain cases, it may be easier to define (dis)similarity between objects than embed them in Euclidean space Algorithms such as k -means require points to be in Euclidean space Ideal Setting: Suppose for some N points in R D we are given all pairwise Euclidean distances in a matrix D Can we reconstruct x 1 , . . . , x N , i.e., all of X ? 15

Multidimensional Scaling Distances are preserved under translation, rotation, reflection, etc. We cannot recover X exactly; we can aim to determine X up to these transformations If D ij is the distance between points x i and x j , then D 2 ij = � x i − x j � 2 = x T i x i − 2 x T i x j + x T j x j = M ii − 2 M ij + M jj Here M = XX T is the N × N matrix of dot products Exercise: Show that assuming � i x i = 0 , M can be recovered from D 16

Multidimensional Scaling Consider the (full) SVD: X = UΣV T We can write M as M = XX T = UΣΣ T U T Starting from M , we can reconstruct ˜ X using the eigendecomposition of M M = UΛU T Because, M is symmetric and positive semi-definite, U T = U − 1 and all entries of (diagonal matrix) Λ are non-negative Let ˜ X = UΛ 1 / 2 If we are satisfied with approximate reconstruction, we can use truncated eigendecomposition 17

Multidimensional Scaling: Additional Comments In general if you define (dis)similarities on objects such as text documents, genetic sequences, etc. , we cannot be sure that the generated similarity matrix M will be positive semi-definite or that the dissimilarity matrix D is a valid squared Euclidean distance If such cases, we cannot always find a Euclidean embeddding that recovers the (dis)similarities exactly Minimize stress function: Find z 1 , . . . , z N that minimizes � ( D ij − � z i − z j � ) 2 S ( Z ) = i � = j Several other types of stress functions can be used 18

Multidimensional Scaling: Summary ◮ In certain applications, it may be easier to define pairwise similarities or distances, rather than construct a Euclidean embedding of discrete objects, e.g., genetic data, text data, etc. ◮ Many machine learning algorithms require (or are more naturally expressed with) data in some Euclidean space ◮ Multidimensional Scaling gives a way to find an embedding of the data in Euclidean space that (approximately) respects the original distance/similarity values 19

Hierarchical Clustering Hierarchical structured data exists all around us ◮ Measurements of different species and individuals within species ◮ Top-level and low-level categories in news articles ◮ Country, county, town level data Two Algorithmic Strategies for Clustering ◮ Agglomerative: Bottom-up, clusters formed by merging smaller clusters ◮ Divisive: Top-down, clusters formed by splitting larger clusters Visualise this as a dendrogram or tree 20

Measuring Dissimilarity at Cluster Level To find hierarchical clusters we need to define dissimilarity at cluster level, not just at datapoints Suppose we have dissimilarity at datapoint level, e.g., d ( x , x ′ ) = � x − x ′ � Different ways to define dissimilarity at cluster level, say C and C ′ ◮ Single Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) min ◮ Complete Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) max ◮ Average Linkage 1 D ( C, C ′ ) = � d ( x , x ′ ) | C | · | C ′ | x ∈ C, x ′ ∈ C ′ 21

Measuring Dissimilarity at Cluster Level ◮ Single Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) min ◮ Complete Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) max ◮ Average Linkage 1 D ( C, C ′ ) = � d ( x , x ′ ) | C | · | C ′ | x ∈ C, x ′ ∈ C ′ 22

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade University of Oxford November 7, 2017 Outline This week, we will study some approaches to clustering Defining an objective function for clustering k -Means formulation for

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Machine Learning 2 DS 4420 - Spring 2018 From clustering to EM Byron C. Wallace Clustering

Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster :

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Clustering: Hierarchical Clustering and K- Means Clustering Machine

Applied Machine Learning Clustering Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Geographic Data Science - Lecture VII Grouping Data over Space Dani Arribas-Bel Today The need

Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

RECSM Summer School: Machine Learning for Social Sciences Session 3.4: Hierarchical Clustering

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

INF4820: Algorithms for AI and NLP Clustering Milen Kouylekov & Stephan Oepen Language

Clustering Techniques Clustering Techniques Berlin Chen 2003 References: 1. Modern Information

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Clustering Data Clustering with user constraints The clustering problem : Given a set of

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2017 2. Clustering Varun Kanade University of Oxford November 7, 2017 Outline This week, we will study some approaches to clustering Defining an objective function for clustering k -Means formulation for

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Machine Learning 2 DS 4420 - Spring 2018 From clustering to EM Byron C. Wallace Clustering

Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster :

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Clustering: Hierarchical Clustering and K- Means Clustering Machine

Applied Machine Learning Clustering Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Geographic Data Science - Lecture VII Grouping Data over Space Dani Arribas-Bel Today The need

Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

RECSM Summer School: Machine Learning for Social Sciences Session 3.4: Hierarchical Clustering

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

INF4820: Algorithms for AI and NLP Clustering Milen Kouylekov &amp; Stephan Oepen Language

Clustering Techniques Clustering Techniques Berlin Chen 2003 References: 1. Modern Information

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Clustering Data Clustering with user constraints The clustering problem : Given a set of

INF4820: Algorithms for AI and NLP Clustering Milen Kouylekov & Stephan Oepen Language