Some mathematics for k -means clustering Rachel Ward Berlin, - PowerPoint PPT Presentation

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015

Part 1: Joint work with Pranjal Awasthi, Afonso Bandeira, Moses Charikar, Ravi Krishnaswamy, and Soledad Villar Part 2: Joint work with Dustin Mixon and Soledad Villar

The basic geometric clustering problem Given a finite dataset P = { x 1 , x 2 , . . ., x N } , and target number of clusters k , find good partition so that data in any given partition are “similar". “Geometric" – assume points embedded in Hilbert space ... Sometimes this is easy.

The basic geometric clustering problem But often it is not so clear (especially with data in R d for d large) ...

k -means clustering Most popular unsupervised clustering method. Points embedded in Euclidean space. ◮ x 1 , x 2 , . . ., x N in R d , pairwise Euclidean distances are � x i − x j � 2 2 . ◮ k -means optimization problem: among all k -partitions C 1 ∪ C 2 ∪ · · · ∪ C k = P , find one that minimizes 2 � � k � � 1 � � � � � min x − x j � � � | C i | � C 1 ∪ C 2 ∪···∪ C k = P � � i = 1 x ∈ C i x j ∈ C i � � ◮ Works well for roughly spherical cluster shapes, uniform cluster sizes

k -means clustering ◮ Classic application: RGB Color quantization ◮ In general, as simple and (nearly) parameter-free pre-processing step for feature learning. These features then used for classification.

Lloyd’s algorithm (’57) (a.k.a. “the" k -means algorithm) Simple algorithm for locally minimizing k -means objective; responsible for popularity of k -means 2 � � k � � 1 � � � � � min x − x j � � � | C i | � C 1 ∪ C 2 ∪···∪ C k = P � � i = 1 x ∈ C i x j ∈ C i � � ◮ Initialize k “means" at random from among data points ◮ Iterate until convergence between (a) assigning each point to nearest mean and (b) computing new means as the average points of each cluster. ◮ Only guaranteed to converge to local minimizers ( k -means is NP-hard)

Lloyd’s algorithm (’57) (a.k.a. “the" k -means algorithm) ◮ Lloyd’s method often converges to local minima ◮ [Arthur, Vassilvitskii ’07] k -means ++ : Better initialization through non-uniform sampling, but still limited in high-dimension. Default in Matlab kmeans () algorithm ◮ [Kannan, Kumar ’10] Initialize Lloyd’s via spectral embedding. ◮ For these methods, no “certificate" of optimality

Points drawn from Gaussian mixture model in R 5 . Initialization for k -means ++ via Matlab 2014b kmeans () , Seed 1 k -means Spectral k -means ++ Semidefinite initialization relaxation

Outline of Talk ◮ Part 1: Generative clustering models and exact recovery guarantees for SDP relaxation of k -means ◮ Part 2: Stability results for SDP relaxation of k -means

Generative models for clustering [Nellore, W ’2013]: Consider the “Stochastic ball model": ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . ◮ Draw n points x ℓ, 1 , x ℓ, 2 , . . ., x ℓ, n from µ ℓ , ℓ = 1 , . . ., k . N = nk . ◮ σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. D ∈ R N × N such that D ( ℓ, i ) , ( m , j ) = � x ( ℓ, i ) − x ( m , j ) � 2 2 Note: Unless Stochastic Block Model, edge weights here are not independent

Stochastic ball model Benchmark for “easy" clustering regime: ∆ ≥ 4 Points within the same cluster are closer to each other than points in different clusters – simple thresholding of distance matrix. Existing clustering guarantees in this regime: [Kumar, Kannan ’10], [Elhamifar, Sapiro, Vidal ’12 ], [Nellore, W. ’13] − ∆ = 3 . 75

Generative models for clustering Benchmark for “nontrivial" clustering case? 2 < ∆ < 4 pairwise distance matrix D no longer looks too much like E [ D ], � � = � c ℓ − c m � 2 2 + 2 σ 2 D ( ℓ, i ) , ( m , j ) E ◮ Minimal number of points n > d where d is ambient dimension ◮ Take care with distribution µ generating points

Subtleties in k -means objective vs. ◮ In one dimension, k -means optimal solution ( k = 2) switches at ∆ = 2 . 75 ◮ [Iguchi, Mixon, Peterson, Villar ’15] Similar phenomenon in 2D for distribution µ supported on boundary of ball, switch at ∆ ≈ 2 . 05

k -means clustering ◮ Recall k -means optimization problem: 2 � � k � � 1 � � � � � min x − x j � � � | C i | � P = C 1 ∪ C 2 ∪···∪ C k � � i = 1 x ∈ C i x j ∈ C i � � ◮ Equivalent optimization problem: k 1 � � � x − y � 2 min | C i | P = C 1 ∪ C 2 ∪···∪ C k i = 1 x , y ∈ C i k 1 � � min D i , j = | C ℓ | P = C 1 ∪ C 2 ∪···∪ C k ℓ = 1 ( i , j ) ∈ C ℓ

k -means clustering ... equivalent to: Z ∈ R N × N � D , Z � min subject to { Rank ( Z ) = k , λ 1 ( Z ) = · · · = λ k ( Z ) = 1 , Z 1 = 1 , Z ≥ 0 } Spectral clustering relaxation: Spectral clustering: Get top k eigenvectors, followed by clustering on reduced space

Our approach: Semidefinite relaxation for k -means [Peng, Wei ’05] Proposed k -means semidefinite relaxation: min � D , Z � subject to { Tr ( Z ) = k , Z � 0 , Z 1 = 1 , Z ≥ 0 } Note: Only parameter in SDP is k , the number of clusters, even though generative model assumes equal num. points n in each cluster

k -means SDP – recovery guarantees ◮ µ is isotropic probability measure in R d supported in a unit ball. ◮ Centers c 1 , c 2 , . . ., c k ∈ R d such that � c i − c j � 2 > ∆ . ◮ µ j as translation of µ to c j . σ 2 = E ( � x ℓ, j − c ℓ � 2 2 ) ≤ 1. Theorem (with A., B., C., K., V. ’14) Suppose � 8 σ 2 ∆ ≥ + 8 d Then k -means SDP recovers clusters as unique optimal � � cn solution with probability ≥ 1 − 2 dk exp − . log 2 ( n ) d Proof: construct dual certificate matrix, PSD, orthogonal to rank- k matrix with entries � x i − c j � 2 2 , satisfies dual constraints bound largest eigenvalue of residual “noise" matrix [Vershynin ’10]

k -means SDP – cluster recovery guarantees Theorem (with A., B., C., K., V. ’14) Suppose � 8 σ 2 ∆ ≥ + 8 d Then k -means SDP recovers clusters as unique optimal � � cn solution with probability ≥ 1 − 2 dk exp − . log 2 ( n ) d ◮ In fact, deterministic dual certificate sufficient condition. The “stochastic ball model" satisfies conditions with high probability. ◮ [Iguchi, Mixon, Peterson, Villar ’15]: Recovery also for √ k ∆ ≥ 2 σ d , constructing different dual certificate

Some mathematics for k -means clustering Rachel Ward Berlin, - PowerPoint PPT Presentation

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work with Pranjal Awasthi, Afonso Bandeira, Moses Charikar, Ravi Krishnaswamy, and Soledad Villar Part 2: Joint work with Dustin Mixon and Soledad Villar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra

Quantum field theory on a quantum space-time: Hawking radiation and the Casimir effect Jorge

Wrap Up! Lecture 25 Decision Trees & Branching Programs Many Topics Not Covered! Decision

My favourite open problems in universal algebra Ross Willard University of Waterloo AMS Spring

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Discovering Interesting Patterns Through Motivation Users Interactive Feedback

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Some mathematics for k -means clustering Rachel Ward Berlin, - PowerPoint PPT Presentation

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work with Pranjal Awasthi, Afonso Bandeira, Moses Charikar, Ravi Krishnaswamy, and Soledad Villar Part 2: Joint work with Dustin Mixon and Soledad Villar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra

Quantum field theory on a quantum space-time: Hawking radiation and the Casimir effect Jorge

Wrap Up! Lecture 25 Decision Trees &amp; Branching Programs Many Topics Not Covered! Decision

My favourite open problems in universal algebra Ross Willard University of Waterloo AMS Spring

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Discovering Interesting Patterns Through Motivation Users Interactive Feedback

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Wrap Up! Lecture 25 Decision Trees & Branching Programs Many Topics Not Covered! Decision