Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 - PowerPoint PPT Presentation

Clustering L´ eon Bottou NEC Labs America COS 424 – 3/4/2010

Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/27 COS 424 – 3/4/2010

Introduction Clustering Assigning observations into subsets with similar characteristics. Applications – medecine, biology, – market research, data mining – image segmentation – search results – topics, taxonomies – communities Why is clustering so attractive? – An embodiment of Descartes’ philosophy “Discourse on the Method of Rightly Conducting One’s Reason”: “. . . divide each of the difficulties under examination . . . as might be necessary for its adequate solution.” L´ eon Bottou 3/27 COS 424 – 3/4/2010

Summary 1. What is a cluster? 2. K-Means 3. Hierarchical clustering 4. Simple Gaussian mixtures L´ eon Bottou 4/27 COS 424 – 3/4/2010

What is a cluster? �� Two neatly separated classes leave a trace in P { X } . L´ eon Bottou 5/27 COS 424 – 3/4/2010

Input space transformations Input space is often an arbitrary decision. For instance: camera pixels versus retina pixels. What happens if we apply a reversible transformation to the inputs? L´ eon Bottou 6/27 COS 424 – 3/4/2010

Input space transformations The Bayes optimal decision boundary moves with the transformation. The Bayes optimal error rate is unchanged. The neatly separated clusters are gone! �� Clustering depends on the arbitrary definition of the input space! This is very different from classification, regression, etc. L´ eon Bottou 7/27 COS 424 – 3/4/2010

K-Means The K-Means problem – Given observations x 1 . . . x n , determine K centroids w 1 . . . w k n � k � x i − w k � 2 . that minimize the distortion C ( w ) = min i =1 Interpretation – Minimize the discretization error. Properties – Non convex objective. – Finding the global minimum is NP-hard in general. – Finding acceptable local minima is surprisingly easy. – Initialization dependent. L´ eon Bottou 8/27 COS 424 – 3/4/2010

Offline K-Means Lloyd’s algorithm initialize centroids w k repeat - assign points to classes: � x i − w k � 2 . ∀ i, s i ← arg min S k ← { i : s i = k } . k - recompute centroids: 1 � � � x i − w � 2 = ∀ k, w k ← arg min x i . card ( S k ) w i ∈ S k i ∈ S k until convergence. L´ eon Bottou 9/27 COS 424 – 3/4/2010

Lloyd’s algorithm – Illustration Initial state: – Squares = data points. – Circles = centroids. L´ eon Bottou 10/27 COS 424 – 3/4/2010

Lloyd’s algorithm – Illustration 1. Assign data points to clusters. L´ eon Bottou 11/27 COS 424 – 3/4/2010

Lloyd’s algorithm – Illustration 2. Recompute centroids. L´ eon Bottou 12/27 COS 424 – 3/4/2010

Lloyd’s algorithm – Illustration Assign data points to clusters. . . L´ eon Bottou 13/27 COS 424 – 3/4/2010

Why does Lloyd’s algorithm work? Consider an arbitrary cluster assignment s i . n n n � � � � x i − w s i � 2 − min k � x i − w k � 2 = � x i − w s i � 2 k � x i − w k � 2 C ( w ) = min − i =1 i =1 i =1 � �� L ( s,w ) D ( s,w ) ≥ 0 �� D �� L �� D �� D �� D �� D � � �� L L L L´ eon Bottou 14/27 COS 424 – 3/4/2010

Online K-Means MacQueen’s algorithm initialize centroids w k and n k = 0 . repeat - pick an observation x t and determine cluster � x t − w k � 2 . s t = arg min k - update centroid s t : � � 1 n s t ← n s t + 1 . w s t ← w s t + x t − w s t . n st until satisfaction. Comments – MacQueen’s algorithm finds decent clusters much faster. – Final convergence could be slow. Do we really care? – Just perform one or two passes over the randomly shuffled observations. L´ eon Bottou 15/27 COS 424 – 3/4/2010

Why does MacQueen’s algorithm work? Explanation 1: Recursive averages. n – Let u n = 1 x i . Then u n = u n − 1 + 1 � n ( x n − u n − 1 ) . n i =1 Explanation 2: Stochastic gradient. � n 1 i =1 min k � x i − w k � 2 : – Apply stochastic gradient to C ( w ) = 2 n � � w s t ← w s t + γ t x t − w s t Explanation 3: Stochastic gradient + Newton. – The Hessian H of C ( w ) is diagonal and contains the fraction of observations assigned to each cluster. w s t ← w s t + 1 = w s t + 1 tH − 1 � � � � x t − w s t x t − w s t n s t L´ eon Bottou 16/27 COS 424 – 3/4/2010

Example: Color quantization of images Problem – Convert a 24 bit RGB image into a indexed image with a palette of K colors. Solution – The ( r, g, b ) values of the pixels are the observations x i – The ( r, g, b ) values of the K palette colors are the centroids w k . – Initialize the w k with an all-purpose palette – Alternatively, initialize the w k with the color of random pixels. – Perform one pass of MacQueen’s algorithm – Eliminate centroids with no observations. – You are done. L´ eon Bottou 17/27 COS 424 – 3/4/2010

How many clusters? Rules of thumb ? – K = 10 , K = √ n , . . . The Elbow method ? �� – Measure the distortion on a validation set. – The distortion decreases when k increases. �� – Sometimes there is no elbow, or several elbows – Local minima mess the picture. � Rate-distortion – Each additional cluster reduces the distortion. – Cost of additional cluster vs. cost of distortion. – Just another way to select K . Conclusion – Clustering is a very subjective matter. L´ eon Bottou 18/27 COS 424 – 3/4/2010

Hierarchical clustering Agglomerative clustering – Initialization: each observation is its own cluster. – Repeatedly merge the closest clusters – single linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) min – complete linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) max – distortion estimates, etc. Divisive clustering – Initialization: one cluster contains all observations. – Repeatedly divide the largest cluster, e.g. 2-Means. – Lots of variants. L´ eon Bottou 19/27 COS 424 – 3/4/2010

K-Means plus Agglomerative Clustering Algorithm – Run K-Means with a large K. – Count the number of observation for each cluster. – Merge the closest clusters according to the following metric. Let A be a cluster with n A members and centroid w A . Let B be a cluster with n B members and centroid w B . The putative center of A ∪ B is w AB = ( n A w A + n b w B ) / ( n A + n B ) . Quick estimate of the distortion increase: � � � � x − w AB � 2 − � x − w A � 2 − � x − w B � 2 d ( A, B ) = x ∈ A ∪ B x ∈ A x ∈ B = n A � w A − w AB � 2 + n B � w B − w AB � 2 L´ eon Bottou 20/27 COS 424 – 3/4/2010

Dendogram L´ eon Bottou 21/27 COS 424 – 3/4/2010

Simple Gaussian mixture (1) Clustering via density estimation. – Pick a parametric model P θ ( X ) . – Maximize likelihood. Pick a parametric model – There are K components – To generate an observation: a.) pick a component k with probabilities λ 1 . . . λ K . b.) generate x from component k with probability N ( µ i , σ ) . Notes – Same standard deviation σ (for now). – That’s why I write “Simple GMM”. L´ eon Bottou 22/27 COS 424 – 3/4/2010

Simple Gaussian mixture (2) Parameters: θ = ( λ 1 , µ 1 , . . . , λ K , µ K ) � x − µy � 2 e − 1 1 σ 2 P θ ( X = x | Y = y ) = Model: P θ ( Y = y ) = λ y . . d σ (2 π ) 2 Likelihood n n K � � � P θ ( Y = y ) P θ ( X = x i | Y = y ) = . . . log L ( θ ) = log P θ ( X = x i ) = log i =1 i =1 y =1 Maximize! – This is non convex. – There are k ! copies of each minimum (local or global). – Conjugate gradients or Newton works. L´ eon Bottou 23/27 COS 424 – 3/4/2010

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 - PowerPoint PPT Presentation

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Automatic Domain Decomposition for a Black-Box PDE Solver Torsten Adolph and Willi Sch onauer

AM 205: lecture 12 Last time: Numerical differentiation, numerical solution of ordinary

Masking schemes: evaluation Oscar Reparaz COSIC/KU Leuven PROOFS Taipei (Taiwan)

The long-term post-outburst spin down of low magnetic field magnetar Swift J1822.3-1606 Scholz

Kai Schneider M 2P2 -CNRS & CMI, Universit de Provence, Marseille, France Joint work with :

Iterative Solvers for Coupled Fluid-Solid Scattering Jan Mandel Center for Computational

Advanced Rendering Advanced Rendering http://www.ugrad.cs.ubc.ca/~cs314/Vjan2013 2 Reading for

REYES You might be surprised to know that most REYES frames of all Pixars films and

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 - PowerPoint PPT Presentation

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Automatic Domain Decomposition for a Black-Box PDE Solver Torsten Adolph and Willi Sch onauer

AM 205: lecture 12 Last time: Numerical differentiation, numerical solution of ordinary

Masking schemes: evaluation Oscar Reparaz COSIC/KU Leuven PROOFS Taipei (Taiwan)

The long-term post-outburst spin down of low magnetic field magnetar Swift J1822.3-1606 Scholz

Kai Schneider M 2P2 -CNRS &amp; CMI, Universit de Provence, Marseille, France Joint work with :

Iterative Solvers for Coupled Fluid-Solid Scattering Jan Mandel Center for Computational

Advanced Rendering Advanced Rendering http://www.ugrad.cs.ubc.ca/~cs314/Vjan2013 2 Reading for

REYES You might be surprised to know that most REYES frames of all Pixars films and

Kai Schneider M 2P2 -CNRS & CMI, Universit de Provence, Marseille, France Joint work with :