Streaming algorithms for k -center clustering with outliers and with - PowerPoint PPT Presentation

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu

k -center clustering problem ● Input: n points in an arbitrary metric space. ● Goal: Partition them into k clusters and assign each a center point to minimize the maximum distance from an input point to its cluster center. k = 3

k -center clustering problem ● Input: n points in an arbitrary metric space. ● Goal: Partition them into k clusters and assign each a center point to minimize the maximum distance from an input point to its cluster center. k = 2

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R 2R ⇒ k clusters, but points left uncovered R too small. Start over w/ bigger guess.

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R Covers entire optimal cluster

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R Covers entire optimal cluster

Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R 2R Covers entire optimal cluster We are guaranteed to succeed w/ ≤ k clusters whenever R ≥ OPT no matter how badly we choose centers, hence we'll take R ≤ OPT.

Streaming model ● Data set too large to fit in memory ● Receive points one at a time (can't start over!) ● Maintain small state, incl. solution for input so far ● Return solution when end of input is reached Solution for input so far Points Small Large data set state

Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R 8R

Doubling Algorithm: raising R ● Oops, we have > k stored centers! – Must drop some and account for the input points they covered within distance 8R. – Obs: Some optimal cluster must cover two stored centers, so OPT ≥ (shortest pairwise distance)/2. – Assuming that stored centers are always separated by 4R, we can raise R to R new = (4R)/2 = 2R. R new R k = 2 8R OPT? 8R 8R ≥ 2 R n e w

Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R 8R

Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R new 8R 4R n e w

Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R new 4R n e w

Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R 8R new k = 2 4R 8R new n e w 4R n e w

Doubling Algorithm: conclusion ● Proceed... ● When end of input is reached, return clusters of radius 8R at stored centers. An 8-approximation. R new 8R new k = 2 8R new

k -center clustering with outliers ● Application: Noisy data ● Clustering can miss up to z input points k = 3, z = 2

k -center clustering with outliers ● Application: Noisy data ● Clustering can miss up to z input points k = 3, z = 2 outliers

k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3

k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 must have ≥ 3 points

k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3

k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 If this point were not in the input...

k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 If this point were not in the input... each point can “belong” to only one cluster even if it is within the radii of several

Streaming algorithms for k -center clustering with outliers and with - PowerPoint PPT Presentation

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu k -center clustering problem Input: n points in an arbitrary

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci

Clustering: Models and Algorithms Shikui Tu 2019-02-28 1 Outline Clustering K-mean

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Engines Previously We talked about the motivation behind vertical search engines,

Words & Pictures Clustering and Bag of Words Many

Text 1 / 24 Introduction to the PS3 Programming the SPEs PS3-clusters Results Why is the

Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alexander Sim, Suren

A Quality Metric for Visualization of Clusters in Graphs Amyra Meidiana, Seok-Hee Hong, Peter

Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) Outline Introduction

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Streaming algorithms for k -center clustering with outliers and with - PowerPoint PPT Presentation

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu k -center clustering problem Input: n points in an arbitrary

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci

Clustering: Models and Algorithms Shikui Tu 2019-02-28 1 Outline Clustering K-mean

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Engines Previously We talked about the motivation behind vertical search engines,

Words &amp; Pictures Clustering and Bag of Words Many

Text 1 / 24 Introduction to the PS3 Programming the SPEs PS3-clusters Results Why is the

Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alexander Sim, Suren

A Quality Metric for Visualization of Clusters in Graphs Amyra Meidiana, Seok-Hee Hong, Peter

Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) Outline Introduction

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Words & Pictures Clustering and Bag of Words Many