clustering
play

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 - PowerPoint PPT Presentation

Clustering L eon Bottou NEC Labs America COS 424 3/4/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear


  1. Clustering L´ eon Bottou NEC Labs America COS 424 – 3/4/2010

  2. Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/27 COS 424 – 3/4/2010

  3. Introduction Clustering Assigning observations into subsets with similar characteristics. Applications – medecine, biology, – market research, data mining – image segmentation – search results – topics, taxonomies – communities Why is clustering so attractive? – An embodiment of Descartes’ philosophy “Discourse on the Method of Rightly Conducting One’s Reason”: “. . . divide each of the difficulties under examination . . . as might be necessary for its adequate solution.” L´ eon Bottou 3/27 COS 424 – 3/4/2010

  4. Summary 1. What is a cluster? 2. K-Means 3. Hierarchical clustering 4. Simple Gaussian mixtures L´ eon Bottou 4/27 COS 424 – 3/4/2010

  5. What is a cluster? ����������������������� Two neatly separated classes leave a trace in P { X } . L´ eon Bottou 5/27 COS 424 – 3/4/2010

  6. Input space transformations Input space is often an arbitrary decision. For instance: camera pixels versus retina pixels. What happens if we apply a reversible transformation to the inputs? L´ eon Bottou 6/27 COS 424 – 3/4/2010

  7. Input space transformations The Bayes optimal decision boundary moves with the transformation. The Bayes optimal error rate is unchanged. The neatly separated clusters are gone! ����������������������� Clustering depends on the arbitrary definition of the input space! This is very different from classification, regression, etc. L´ eon Bottou 7/27 COS 424 – 3/4/2010

  8. K-Means The K-Means problem – Given observations x 1 . . . x n , determine K centroids w 1 . . . w k n � k � x i − w k � 2 . that minimize the distortion C ( w ) = min i =1 Interpretation – Minimize the discretization error. Properties – Non convex objective. – Finding the global minimum is NP-hard in general. – Finding acceptable local minima is surprisingly easy. – Initialization dependent. L´ eon Bottou 8/27 COS 424 – 3/4/2010

  9. Offline K-Means Lloyd’s algorithm initialize centroids w k repeat - assign points to classes: � x i − w k � 2 . ∀ i, s i ← arg min S k ← { i : s i = k } . k - recompute centroids: 1 � � � x i − w � 2 = ∀ k, w k ← arg min x i . card ( S k ) w i ∈ S k i ∈ S k until convergence. L´ eon Bottou 9/27 COS 424 – 3/4/2010

  10. Lloyd’s algorithm – Illustration Initial state: – Squares = data points. – Circles = centroids. L´ eon Bottou 10/27 COS 424 – 3/4/2010

  11. Lloyd’s algorithm – Illustration 1. Assign data points to clusters. L´ eon Bottou 11/27 COS 424 – 3/4/2010

  12. Lloyd’s algorithm – Illustration 2. Recompute centroids. L´ eon Bottou 12/27 COS 424 – 3/4/2010

  13. Lloyd’s algorithm – Illustration Assign data points to clusters. . . L´ eon Bottou 13/27 COS 424 – 3/4/2010

  14. Why does Lloyd’s algorithm work? Consider an arbitrary cluster assignment s i . n n n � � � � x i − w s i � 2 − min k � x i − w k � 2 = � x i − w s i � 2 k � x i − w k � 2 C ( w ) = min − i =1 i =1 i =1 � �� � � �� � L ( s,w ) D ( s,w ) ≥ 0 ���������� � � ������������� D ������������������������ ���������� � � ������������� L ������������ D ������������������� D ���� ���� D ��� D � � ���� L L L L´ eon Bottou 14/27 COS 424 – 3/4/2010

  15. Online K-Means MacQueen’s algorithm initialize centroids w k and n k = 0 . repeat - pick an observation x t and determine cluster � x t − w k � 2 . s t = arg min k - update centroid s t : � � 1 n s t ← n s t + 1 . w s t ← w s t + x t − w s t . n st until satisfaction. Comments – MacQueen’s algorithm finds decent clusters much faster. – Final convergence could be slow. Do we really care? – Just perform one or two passes over the randomly shuffled observations. L´ eon Bottou 15/27 COS 424 – 3/4/2010

  16. Why does MacQueen’s algorithm work? Explanation 1: Recursive averages. n – Let u n = 1 x i . Then u n = u n − 1 + 1 � n ( x n − u n − 1 ) . n i =1 Explanation 2: Stochastic gradient. � n 1 i =1 min k � x i − w k � 2 : – Apply stochastic gradient to C ( w ) = 2 n � � w s t ← w s t + γ t x t − w s t Explanation 3: Stochastic gradient + Newton. – The Hessian H of C ( w ) is diagonal and contains the fraction of observations assigned to each cluster. w s t ← w s t + 1 = w s t + 1 tH − 1 � � � � x t − w s t x t − w s t n s t L´ eon Bottou 16/27 COS 424 – 3/4/2010

  17. Example: Color quantization of images Problem – Convert a 24 bit RGB image into a indexed image with a palette of K colors. Solution – The ( r, g, b ) values of the pixels are the observations x i – The ( r, g, b ) values of the K palette colors are the centroids w k . – Initialize the w k with an all-purpose palette – Alternatively, initialize the w k with the color of random pixels. – Perform one pass of MacQueen’s algorithm – Eliminate centroids with no observations. – You are done. L´ eon Bottou 17/27 COS 424 – 3/4/2010

  18. How many clusters? Rules of thumb ? – K = 10 , K = √ n , . . . The Elbow method ? ���� – Measure the distortion on a validation set. – The distortion decreases when k increases. ����������������� – Sometimes there is no elbow, or several elbows – Local minima mess the picture. � Rate-distortion – Each additional cluster reduces the distortion. – Cost of additional cluster vs. cost of distortion. – Just another way to select K . Conclusion – Clustering is a very subjective matter. L´ eon Bottou 18/27 COS 424 – 3/4/2010

  19. Hierarchical clustering Agglomerative clustering – Initialization: each observation is its own cluster. – Repeatedly merge the closest clusters – single linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) min – complete linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) max – distortion estimates, etc. Divisive clustering – Initialization: one cluster contains all observations. – Repeatedly divide the largest cluster, e.g. 2-Means. – Lots of variants. L´ eon Bottou 19/27 COS 424 – 3/4/2010

  20. K-Means plus Agglomerative Clustering Algorithm – Run K-Means with a large K. – Count the number of observation for each cluster. – Merge the closest clusters according to the following metric. Let A be a cluster with n A members and centroid w A . Let B be a cluster with n B members and centroid w B . The putative center of A ∪ B is w AB = ( n A w A + n b w B ) / ( n A + n B ) . Quick estimate of the distortion increase: � � � � x − w AB � 2 − � x − w A � 2 − � x − w B � 2 d ( A, B ) = x ∈ A ∪ B x ∈ A x ∈ B = n A � w A − w AB � 2 + n B � w B − w AB � 2 L´ eon Bottou 20/27 COS 424 – 3/4/2010

  21. Dendogram L´ eon Bottou 21/27 COS 424 – 3/4/2010

  22. Simple Gaussian mixture (1) Clustering via density estimation. – Pick a parametric model P θ ( X ) . – Maximize likelihood. Pick a parametric model – There are K components – To generate an observation: a.) pick a component k with probabilities λ 1 . . . λ K . b.) generate x from component k with probability N ( µ i , σ ) . Notes – Same standard deviation σ (for now). – That’s why I write “Simple GMM”. L´ eon Bottou 22/27 COS 424 – 3/4/2010

  23. Simple Gaussian mixture (2) Parameters: θ = ( λ 1 , µ 1 , . . . , λ K , µ K ) � x − µy � 2 e − 1 1 σ 2 P θ ( X = x | Y = y ) = Model: P θ ( Y = y ) = λ y . . d σ (2 π ) 2 Likelihood n n K � � � P θ ( Y = y ) P θ ( X = x i | Y = y ) = . . . log L ( θ ) = log P θ ( X = x i ) = log i =1 i =1 y =1 Maximize! – This is non convex. – There are k ! copies of each minimum (local or global). – Conjugate gradients or Newton works. L´ eon Bottou 23/27 COS 424 – 3/4/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend