Clustering - Classification non-supervise Alexandre Gramfort - PowerPoint PPT Presentation

Clustering - Classification non-supervisée Alexandre Gramfort alexandre.gramfort@inria.fr Inria - Université Paris-Saclay Huawei Mathematical Coffee March 16 2018

Clustering: Challenges and a formal model Algorithms References Outline Clustering: Challenges and a formal model 1 Algorithms 2 References 3 Alexandre Gramfort - Inria Clustering - Classification non-supervisée 2

Clustering: Challenges and a formal model Algorithms References What is clustering? One of the most widely used techniques for exploratory data analysis Get intuition about data by identifying meaningful groups among the data points Knowledge discovery Examples Identify groups of customers for targeted marketing Identify groups of similar individuals in a social network Identify groups of genes based on their expresssions (phenotypes) Alexandre Gramfort - Inria Clustering - Classification non-supervisée 3

Clustering: Challenges and a formal model Algorithms References A fuzzy definition Definition (Clustering) Task of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are separated into different groups. More rigorous definition not so obvious Clustering is a transitive relation Similarity is not: imagine x 1 , . . . , x m such that each x i is very similar to its two neighbors, x i − 1 and x i + 1 , but x 1 and x m are very dissimilar. Alexandre Gramfort - Inria Clustering - Classification non-supervisée 4

Clustering: Challenges and a formal model Algorithms References Illustration Alexandre Gramfort - Inria Clustering - Classification non-supervisée 5

Clustering: Challenges and a formal model Algorithms References Absence of ground truth Clustering is an unsupervised learning problem (learning from unlabeled data). For supervised learning the metric of performance is clear For clustering there is no clear success evaluation procedure For clustering there is no ground truth For clustering it is unclear what the correct answer is Alexandre Gramfort - Inria Clustering - Classification non-supervisée 6

Clustering: Challenges and a formal model Algorithms References Absence of ground truth Both of these solutions are equally justifiable solutions: Alexandre Gramfort - Inria Clustering - Classification non-supervisée 7

Clustering: Challenges and a formal model Algorithms References To sum up Summary There may be several very different conceivable clustering solutions for a given data set. As a result, there is a wide variety of clustering algorithms that, on some input data, will output very different clusterings. Alexandre Gramfort - Inria Clustering - Classification non-supervisée 8

Clustering: Challenges and a formal model Algorithms References Zoology of clustering methods Source: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html Alexandre Gramfort - Inria Clustering - Classification non-supervisée 9

Clustering: Challenges and a formal model Algorithms References A clustering model Input A set of elements, X , and a distance function over it. That is, a function d : X × X → R + that is symmetric, satisfies d ( x , x ) = 0 for all x ∈ X , and (often) also satisfies the triangle inequality. Alternatively, the function could be a similarity function s : X × X → [ 0 , 1 ] that is symmetric and satisfies s ( x , x ) = 1 for all x ∈ X . Also, clustering algorithms typically require: a parameter k (determining the number of required clusters). or a bandwidth / threshold parameter ǫ (determining how close points in a same cluster should be). Alexandre Gramfort - Inria Clustering - Classification non-supervisée 10

Clustering: Challenges and a formal model Algorithms References A clustering model Output A partition of the domain set X into subsets: C = ( C 1 , . . . , C k ) where ∪ k i = 1 C i = X and for all i � = j , C i ∩ C j = ∅ . In some situations the clustering is “soft”. The output is a probabilistic assignment to each domain point: ∀ x ∈ X , we get ( p 1 ( x ) , . . . , p k ( x )) , where p i ( x ) = P [ x ∈ C i ] is the probability that x belongs to cluster C i . Another possible output is a clustering dendrogram, which is a hierarchical tree of domain subsets, having the singleton sets in its leaves, and the full domain as its root. Alexandre Gramfort - Inria Clustering - Classification non-supervisée 11

Clustering: Challenges and a formal model Algorithms References Outline Clustering: Challenges and a formal model 1 Algorithms 2 K-Means and other cost minimization clusterings DBSCAN: Density based clustering References 3 Alexandre Gramfort - Inria Clustering - Classification non-supervisée 12

Clustering: Challenges and a formal model Algorithms References History k-means is certainly the most well known clustering algorithm The k-means algorithm is attributed to Lloyd (1957) and was only published in a journal in 1982. There is a lot of misunderstanding on the underlying hypothesis . . . and the limitations There is still a lot of research to speed up this algorithm (k-means++ initialization [Arthur et al. 2007], online k-means [Sculley 2010], triangular inequality trick [Elkan ICML 2003], Yinyang k-means [Ding et al. ICML 2015], better initialization [Bachem et al. NIPS 2016]). Alexandre Gramfort - Inria Clustering - Classification non-supervisée 13

Clustering: Challenges and a formal model Algorithms References Cost minimization clusterings Find a partition C = ( C 1 , . . . , C k ) of minimal cost G (( X , d ) , C ) is the objective to be minimized Note Most of the resulting optimization problems are NP-hard, and some are even NP-hard to approximate. Consequently, when people talk about, say, k-means clustering, they often refer to some particular common approximation algorithm rather than the cost function or the corresponding exact solution of the minimization problem. Alexandre Gramfort - Inria Clustering - Classification non-supervisée 14

Clustering: Challenges and a formal model Algorithms References The k-means objective function Data is partitioned into disjoint sets C 1 , . . . , C k where each C i is represented by a centroid µ i . We assume that the input set X is embedded in some larger metric space ( X ′ , d ) , such as R p , (so that X ⊆ X ′ ) and centroids are members of X ′ . k-means objective function measures the squared distance between each point in X to the centroid of its cluster. Formally: � d ( x , µ ) 2 µ i ( C i ) = arg min µ ∈X ′ x ∈ C i k d ( x , µ i ( C i )) 2 � � G k-means (( X , d ) , ( C 1 , . . . , C k )) = x ∈ C i i = 1 Note: G k-means is often refered to as inertia . Alexandre Gramfort - Inria Clustering - Classification non-supervisée 15

Clustering: Challenges and a formal model Algorithms References The k-means objective function Which can be rewritten: k � � d ( x , µ i ) 2 G k-means (( X , d ) , ( C 1 , . . . , C k )) = min µ 1 ,...µ k ∈X ′ i = 1 x ∈ C i Samples KMeans Alexandre Gramfort - Inria Clustering - Classification non-supervisée 16

Clustering: Challenges and a formal model Algorithms References The k-medoids objective function Similar to the k-means objective, except that it requires the cluster centroids to be members of the input set: k d ( x , µ i ) 2 � � G k-medoids (( X , d ) , ( C 1 , . . . , C k )) = min µ 1 ,...µ k ∈X i = 1 x ∈ C i Alexandre Gramfort - Inria Clustering - Classification non-supervisée 17

Clustering: Challenges and a formal model Algorithms References The k-median objective function Similar to the k-medoids objective, except that the “distortion” between a data point and the centroid of its cluster is measured by distance, rather than by the square of the distance: k � � G k-median (( X , d ) , ( C 1 , . . . , C k )) = min d ( x , µ i ) µ 1 ,...µ k ∈X i = 1 x ∈ C i Example An example is the facility location problem. Consider the task of locating k fire stations in a city. One can model houses as data points and aim to place the stations so as to minimize the average distance between a house and its closest fire station. Alexandre Gramfort - Inria Clustering - Classification non-supervisée 18

Clustering: Challenges and a formal model Algorithms References Remarks The latter objective functions are center based: k � � G f (( X , d ) , ( C 1 , . . . , C k )) = min f ( d ( x , µ i )) µ 1 ,...µ k ∈X ′ i = 1 x ∈ C i Some objective functions are not center based. For example, the sum of in-cluster distances (SOD) k � � G SOD (( X , d ) , ( C 1 , . . . , C k )) = d ( x , y ) i = 1 x , y ∈ C i Alexandre Gramfort - Inria Clustering - Classification non-supervisée 19

Clustering: Challenges and a formal model Algorithms References k-means algorithm We describe the algorithm with respect to the Euclidean distance function d ( x , y ) = � x − y � . Algorithm 1 (Vanilla) k-Means algorithm 1: procedure Input: X ⊂ R n ; Number of clusters k . 2: Initialize: Randomly choose initial centroids µ 1 , . . . , µ k . 3: Repeat until convergence: 4: ∀ i ∈ [ k ] set C i = { x ∈ X , i = arg min j � x − µ j �} 5: 6: 1 ∀ i ∈ [ k ] update µ i = � x ∈ C i x 7: | C i | 8: 9: end procedure Alexandre Gramfort - Inria Clustering - Classification non-supervisée 20

Clustering - Classification non-supervise Alexandre Gramfort - PowerPoint PPT Presentation

Clustering - Classification non-supervise Alexandre Gramfort alexandre.gramfort@inria.fr Inria - Universit Paris-Saclay Huawei Mathematical Coffee March 16 2018 Clustering: Challenges and a formal model Algorithms References Outline

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

Retrofits for Resiliency - How to Make an Existing Solar PV System into an Islandable Resilient

The Unger Parser brought to you today by: Anne Brock Outline Unger - the man Unger - the

Clustering Nathaniel Lewis How it works Read in Historic Data Generate Centroids randomly

DARTMOUTH AND BEYOND John McCarthy, Stanford University The symbolic role of the Dartmouth

CROSS-BAR Monday 19 September 2016 HISTORY OF CROSS-BAR EXCHANGES 1. 1915: Bell company Western

Technology and Society - A Timeline Dr. Gabriela Avram Two weeks ago p We looked at how

4 - #join Y Assembly to the Box JellyBox Build: 15_Y-Assembly Join (link directly to the y

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part