some mathematics for k means clustering
play

Some mathematics for k -means clustering Rachel Ward Berlin, - PowerPoint PPT Presentation

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work with Pranjal Awasthi, Afonso Bandeira, Moses Charikar, Ravi Krishnaswamy, and Soledad Villar Part 2: Joint work with Dustin Mixon and Soledad Villar


  1. Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015

  2. Part 1: Joint work with Pranjal Awasthi, Afonso Bandeira, Moses Charikar, Ravi Krishnaswamy, and Soledad Villar Part 2: Joint work with Dustin Mixon and Soledad Villar

  3. The basic geometric clustering problem Given a finite dataset P = { x 1 , x 2 , . . ., x N } , and target number of clusters k , find good partition so that data in any given partition are “similar". “Geometric" – assume points embedded in Hilbert space ... Sometimes this is easy.

  4. The basic geometric clustering problem Given a finite dataset P = { x 1 , x 2 , . . ., x N } , and target number of clusters k , find good partition so that data in any given partition are “similar". “Geometric" – assume points embedded in Hilbert space ... Sometimes this is easy.

  5. The basic geometric clustering problem But often it is not so clear (especially with data in R d for d large) ...

  6. k -means clustering Most popular unsupervised clustering method. Points embedded in Euclidean space. ◮ x 1 , x 2 , . . ., x N in R d , pairwise Euclidean distances are � x i − x j � 2 2 . ◮ k -means optimization problem: among all k -partitions C 1 ∪ C 2 ∪ · · · ∪ C k = P , find one that minimizes 2 � � k � � 1 � � � � � min x − x j � � � | C i | � C 1 ∪ C 2 ∪···∪ C k = P � � i = 1 x ∈ C i x j ∈ C i � � ◮ Works well for roughly spherical cluster shapes, uniform cluster sizes

  7. k -means clustering ◮ Classic application: RGB Color quantization ◮ In general, as simple and (nearly) parameter-free pre-processing step for feature learning. These features then used for classification.

  8. Lloyd’s algorithm (’57) (a.k.a. “the" k -means algorithm) Simple algorithm for locally minimizing k -means objective; responsible for popularity of k -means 2 � � k � � 1 � � � � � min x − x j � � � | C i | � C 1 ∪ C 2 ∪···∪ C k = P � � i = 1 x ∈ C i x j ∈ C i � � ◮ Initialize k “means" at random from among data points ◮ Iterate until convergence between (a) assigning each point to nearest mean and (b) computing new means as the average points of each cluster. ◮ Only guaranteed to converge to local minimizers ( k -means is NP-hard)

  9. Lloyd’s algorithm (’57) (a.k.a. “the" k -means algorithm) ◮ Lloyd’s method often converges to local minima ◮ [Arthur, Vassilvitskii ’07] k -means ++ : Better initialization through non-uniform sampling, but still limited in high-dimension. Default in Matlab kmeans () algorithm ◮ [Kannan, Kumar ’10] Initialize Lloyd’s via spectral embedding. ◮ For these methods, no “certificate" of optimality

  10. Points drawn from Gaussian mixture model in R 5 . Initialization for k -means ++ via Matlab 2014b kmeans () , Seed 1 k -means Spectral k -means ++ Semidefinite initialization relaxation

  11. Outline of Talk ◮ Part 1: Generative clustering models and exact recovery guarantees for SDP relaxation of k -means ◮ Part 2: Stability results for SDP relaxation of k -means

  12. Generative models for clustering [Nellore, W ’2013]: Consider the “Stochastic ball model": ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . ◮ Draw n points x ℓ, 1 , x ℓ, 2 , . . ., x ℓ, n from µ ℓ , ℓ = 1 , . . ., k . N = nk . ◮ σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. D ∈ R N × N such that D ( ℓ, i ) , ( m , j ) = � x ( ℓ, i ) − x ( m , j ) � 2 2 Note: Unless Stochastic Block Model, edge weights here are not independent

  13. Generative models for clustering [Nellore, W ’2013]: Consider the “Stochastic ball model": ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . ◮ Draw n points x ℓ, 1 , x ℓ, 2 , . . ., x ℓ, n from µ ℓ , ℓ = 1 , . . ., k . N = nk . ◮ σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. D ∈ R N × N such that D ( ℓ, i ) , ( m , j ) = � x ( ℓ, i ) − x ( m , j ) � 2 2 Note: Unless Stochastic Block Model, edge weights here are not independent

  14. Stochastic ball model Benchmark for “easy" clustering regime: ∆ ≥ 4 Points within the same cluster are closer to each other than points in different clusters – simple thresholding of distance matrix. Existing clustering guarantees in this regime: [Kumar, Kannan ’10], [Elhamifar, Sapiro, Vidal ’12 ], [Nellore, W. ’13] − ∆ = 3 . 75

  15. Generative models for clustering Benchmark for “nontrivial" clustering case? 2 < ∆ < 4 pairwise distance matrix D no longer looks too much like E [ D ], � � = � c ℓ − c m � 2 2 + 2 σ 2 D ( ℓ, i ) , ( m , j ) E ◮ Minimal number of points n > d where d is ambient dimension ◮ Take care with distribution µ generating points

  16. Subtleties in k -means objective vs. ◮ In one dimension, k -means optimal solution ( k = 2) switches at ∆ = 2 . 75 ◮ [Iguchi, Mixon, Peterson, Villar ’15] Similar phenomenon in 2D for distribution µ supported on boundary of ball, switch at ∆ ≈ 2 . 05

  17. k -means clustering ◮ Recall k -means optimization problem: 2 � � k � � 1 � � � � � min x − x j � � � | C i | � P = C 1 ∪ C 2 ∪···∪ C k � � i = 1 x ∈ C i x j ∈ C i � � ◮ Equivalent optimization problem: k 1 � � � x − y � 2 min | C i | P = C 1 ∪ C 2 ∪···∪ C k i = 1 x , y ∈ C i k 1 � � min D i , j = | C ℓ | P = C 1 ∪ C 2 ∪···∪ C k ℓ = 1 ( i , j ) ∈ C ℓ

  18. k -means clustering ◮ Recall k -means optimization problem: 2 � � k � � 1 � � � � � min x − x j � � � | C i | � P = C 1 ∪ C 2 ∪···∪ C k � � i = 1 x ∈ C i x j ∈ C i � � ◮ Equivalent optimization problem: k 1 � � � x − y � 2 min | C i | P = C 1 ∪ C 2 ∪···∪ C k i = 1 x , y ∈ C i k 1 � � min D i , j = | C ℓ | P = C 1 ∪ C 2 ∪···∪ C k ℓ = 1 ( i , j ) ∈ C ℓ

  19. k -means clustering ... equivalent to: Z ∈ R N × N � D , Z � min subject to { Rank ( Z ) = k , λ 1 ( Z ) = · · · = λ k ( Z ) = 1 , Z 1 = 1 , Z ≥ 0 } Spectral clustering relaxation: Spectral clustering: Get top k eigenvectors, followed by clustering on reduced space

  20. k -means clustering ... equivalent to: Z ∈ R N × N � D , Z � min subject to { Rank ( Z ) = k , λ 1 ( Z ) = · · · = λ k ( Z ) = 1 , Z 1 = 1 , Z ≥ 0 } Spectral clustering relaxation: Spectral clustering: Get top k eigenvectors, followed by clustering on reduced space

  21. Our approach: Semidefinite relaxation for k -means [Peng, Wei ’05] Proposed k -means semidefinite relaxation: min � D , Z � subject to { Tr ( Z ) = k , Z � 0 , Z 1 = 1 , Z ≥ 0 } Note: Only parameter in SDP is k , the number of clusters, even though generative model assumes equal num. points n in each cluster

  22. k -means SDP – recovery guarantees ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. Theorem (with A., B., C., K., V. ’14) Suppose � 8 σ 2 ∆ ≥ + 8 d Then k -means SDP recovers clusters as unique optimal � � cn solution with probability ≥ 1 − 2 dk exp − . log 2 ( n ) d Proof: construct dual certificate matrix, PSD, orthogonal to rank- k matrix with entries � x i − c j � 2 2 , satisfies dual constraints bound largest eigenvalue of residual “noise" matrix [Vershynin ’10]

  23. k -means SDP – recovery guarantees ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. Theorem (with A., B., C., K., V. ’14) Suppose � 8 σ 2 ∆ ≥ + 8 d Then k -means SDP recovers clusters as unique optimal � � cn solution with probability ≥ 1 − 2 dk exp − . log 2 ( n ) d Proof: construct dual certificate matrix, PSD, orthogonal to rank- k matrix with entries � x i − c j � 2 2 , satisfies dual constraints bound largest eigenvalue of residual “noise" matrix [Vershynin ’10]

  24. k -means SDP – cluster recovery guarantees Theorem (with A., B., C., K., V. ’14) Suppose � 8 σ 2 ∆ ≥ + 8 d Then k -means SDP recovers clusters as unique optimal � � cn solution with probability ≥ 1 − 2 dk exp − . log 2 ( n ) d ◮ In fact, deterministic dual certificate sufficient condition. The “stochastic ball model" satisfies conditions with high probability. ◮ [Iguchi, Mixon, Peterson, Villar ’15]: Recovery also for √ k ∆ ≥ 2 σ d , constructing different dual certificate

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend