Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - PowerPoint PPT Presentation

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July 25, 2017 1-1

Clustering • Metric space ( X , d ) • n input points A ; want to find k centers • Objective function ( k -median): � min d ( p , K ) K ⊆ A : | K | = k p ∈ A p ∈ A d 2 ( p , K ) • k -means: � k -center: max p ∈ A d ( p , K ) 2-1

Clustering with outliers • Metric space ( X , d ) • n input points A ; want to find k centers, t outliers • Objective function (( k , t )-median): � min d ( p , K ) K , O ⊆ A : | K | = k , | O |≤ t p ∈ A \ O p ∈ A \ O d 2 ( p , K ) • ( k , t )-means: � ( k , t )-center: max p ∈ A \ O d ( p , K ) Motivation: partial optimization gives much better results 3-1

Distributed clustering • s sites, coordinator model • Site i gets A i , parties want to cluster A = A 1 ∪ . . . ∪ A s • Want to minimize comm. cost and #comm. rounds • For simiplicity, assume each point takes ˜ O (1) bits Motivation: data is inherently distributed / data is big and does not fit one machine C ∅ Coordinator model one round · · · S s S 1 S 3 S 2 A 1 A 2 A 3 A s 4-1

Clustering on uncertain data • Each data item j is a distribution; call it a node . Motivation: data is noisy; a subfield in databases Let σ ( j ) denote a realization, π ( j ) the center point to which j is attached. 5-1

Clustering on uncertain data • Each data item j is a distribution; call it a node . Motivation: data is noisy; a subfield in databases Let σ ( j ) denote a realization, π ( j ) the center point to which j is attached. • Objective function ( k -median): � min E σ [ d ( σ ( j ) , π ( j ))] K ⊆ A : | K | = k j ∈ A • k -means: replace d ( p , K ) with d 2 ( p , K ). • k -center has two versions: E and max do not commute. – max j ∈ A E[ d ( σ ( j ) , π ( j ))] – E [max j ∈ A d ( σ ( j ) , π ( j ))] 5-2

Clustering with outlier on uncertain data • Each data item j is a distribution; call it a node . Let σ ( j ) denote a realization, π ( j ) the center point to which j is attached. • Objective function (( k , t )-median): � min E σ [ d ( σ ( j ) , π ( j ))] K , O ⊆ A : | K | = k , | O |≤ t p ∈ A \ O • ( k , t )-means: replace d ( p , K ) with d 2 ( p , K ). • ( k , t )-center has two versions: E and max do not commute. – max j ∈ A \ O E[ d ( σ ( j ) , π ( j ))] ( k , t )-center-pp � � – E max j ∈ A \ O d ( σ ( j ) , π ( j )) ( k , t )-center-global 6-1

Old and New Problems Problems studied before • Clustering [??, XXXX] • Clustering with outliers [CKMN, 2001] • Clustering on uncertain data [CM, 2008] • Distributed clustering [??, XXXX] • Distributed clustering with outliers for k -center [MKCWM, 2015] Implicitly also in [GMMMO, 2003] New problems • Distributed clustering with outliers for k -median/means • Distributed clustering (with outliers) for uncertain data 7-1

Old and New Problems Problems studied before • Clustering [??, XXXX] • Clustering with outliers [CKMN, 2001] • Clustering on uncertain data [CM, 2008] • Distributed clustering [??, XXXX] • Distributed clustering with outliers for k -center This paper [MKCWM, 2015] Implicitly also in [GMMMO, 2003] New problems • Distributed clustering with outliers for k -median/means This paper • Distributed clustering (with outliers) for uncertain data This paper 7-2

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points 8-1

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points Main results (all in 2 rounds; under the same framework): • ( O (1) , 1)-approx with ˜ O ( sk + t ) comm. for ( k , t )-median/center 8-2

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points Main results (all in 2 rounds; under the same framework): • ( O (1) , 1)-approx with ˜ O ( sk + t ) comm. for ( k , t )-median/center • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for ( k , t )-median and ( k , t )-means, with quadratic local time 8-3

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points Main results (all in 2 rounds; under the same framework): • ( O (1) , 1)-approx with ˜ O ( sk + t ) comm. for ( k , t )-median/center • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for ( k , t )-median and ( k , t )-means, with quadratic local time Also leads to subquadratic time ( O (1) , O (1))-approx centralized algorithms (open for many years) 8-4

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points Main results (all in 2 rounds; under the same framework): • ( O (1) , 1)-approx with ˜ O ( sk + t ) comm. for ( k , t )-median/center • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for ( k , t )-median and ( k , t )-means, with quadratic local time Also leads to subquadratic time ( O (1) , O (1))-approx centralized algorithms (open for many years) • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for uncertain ( k , t )-median/means and ( k , t )-center-pp 8-5

Main results Bicriteria: ( α, β )-approx if the cost of SOL is at most α C while excluding β t points, where C is OPT for excluding t points Main results (all in 2 rounds; under the same framework): • ( O (1) , 1)-approx with ˜ O ( sk + t ) comm. for ( k , t )-median/center • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for ( k , t )-median and ( k , t )-means, with quadratic local time Also leads to subquadratic time ( O (1) , O (1))-approx centralized algorithms (open for many years) • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + t ) comm. for uncertain ( k , t )-median/means and ( k , t )-center-pp • ((1 + 1 /ǫ ) , 1 + ǫ )-approx with ˜ O ( sk + tI + s log ∆) comm. for uncertain ( k , t )-center-global, where I is the info to encode the distribution, and ∆ is the max-distance/min-distance 8-6

Previous results on distributed clustering with outliers • ˜ O ( sk + st ) bits in 2 rounds for k -center (Malkomes, Kusner, Chen, Weinberger, Moseley. 2015) • ˜ O ( sk + st ) bits in 1 round for k -median/means/center Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003) 9-1

Previous results on distributed clustering with outliers • ˜ O ( sk + st ) bits in 2 rounds for k -center (Malkomes, Kusner, Chen, Weinberger, Moseley. 2015) • ˜ O ( sk + st ) bits in 1 round for k -median/means/center Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003) • Interesting range of parameters: n ≫ t ≫ k , s . Consider a modest data set, say n = 10 8 . Suppose that 0 . 1% is noise, thus t = 0 . 001 × n = 10 5 . Say s = 1000, and k = 100. Then sk = 10 5 , and st = 10 8 Consequently sk + st = 10 8 while sk + t = 10 5 9-2

Previous results on distributed clustering with outliers • ˜ O ( sk + st ) bits in 2 rounds for k -center (Malkomes, Kusner, Chen, Weinberger, Moseley. 2015) • ˜ O ( sk + st ) bits in 1 round for k -median/means/center Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003) • Interesting range of parameters: n ≫ t ≫ k , s . Consider a modest data set, say n = 10 8 . Suppose that 0 . 1% is noise, thus t = 0 . 001 × n = 10 5 . Say s = 1000, and k = 100. Then sk = 10 5 , and st = 10 8 Consequently sk + st = 10 8 while sk + t = 10 5 • Goal: reduce the st term to t , since the difference ⇒ your data/energy/time bill. 9-3

More related work (Centralized) clustering with outliers • 3-approx for ( k , t )-center, ( O (1) , O (1))-approx for ( k , t )-median (Charikar, Khuller, Mount, Narasimhan, 2001) O (1)-approx for ( k , t )-median by Ke Chen (2008) • ( k , t )-median with different loss functions. (Feldman, Schulman, 2012) 10-1

More related work (Centralized) clustering with outliers • 3-approx for ( k , t )-center, ( O (1) , O (1))-approx for ( k , t )-median (Charikar, Khuller, Mount, Narasimhan, 2001) O (1)-approx for ( k , t )-median by Ke Chen (2008) • ( k , t )-median with different loss functions. (Feldman, Schulman, 2012) Uncertain data • Uncertain k -center/median/means (Cormode, McGregor, 2008) • Better results for k -center (Guha, Munagala, 2009) 10-2

More related work (Centralized) clustering with outliers • 3-approx for ( k , t )-center, ( O (1) , O (1))-approx for ( k , t )-median (Charikar, Khuller, Mount, Narasimhan, 2001) O (1)-approx for ( k , t )-median by Ke Chen (2008) • ( k , t )-median with different loss functions. (Feldman, Schulman, 2012) Uncertain data • Uncertain k -center/median/means (Cormode, McGregor, 2008) • Better results for k -center (Guha, Munagala, 2009) Distributed clustering (coordinator model) • O (1)-approx with ˜ O ( kd + sk ) for k -median/means in d -dim Euclidean space (Balcan, Ehrlich, Liang, 2013) • Better results for k -means by (Liang, Balcan, Kanchanapally, Woodruff, 2014), and (Cohen, Elder, Musco, Musco, Persu, 2015). 10-3

Distributed ( k , t )-median and the Algorithm Framework 11-1

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - PowerPoint PPT Presentation

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July 25, 2017 1-1 Clustering Metric space ( X , d ) n input points A ; want to find k centers Objective function ( k -median): min d ( p

CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev

China Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE Army of the First

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streams Sudipto Guha (UPenn)

Graphs and Linear Measurements Sudipto Guha University of Pennsylvania (based on joint work with

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

February 21, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Protein Design n Determine amino

Computational Drug Discovery Guha. January 10, 2006 Two Revolutions Guha. January 10, 2006 A

February 14, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Pareto Optimality From

January 31, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Laws of Thermodynamics Energy is

Deriving SGD for Neural Networks Swarthmore College CS63 Spring 2018 A neural network NN computes

Summary of test results of MQXFS1 the first short model 150 mm aperture Nb 3 Sn quadrupole for

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

Application of Application of Loss Metrics to Loss Metrics to Multimedia Multimedia Streams

Optimizing the Partial AUC Harikrishna Narasimhan and Shivani Agarwal Department of Computer

Work Zone Data Initiative Briefing How many active work zones were there in the US last

How to ask questions and comment: Please use the Q&A pod to comments and ask questions:

2/26/19 Webinar: School Breakfast Success Please make sure your computer volume is turned ON.

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - PowerPoint PPT Presentation

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July 25, 2017 1-1 Clustering Metric space ( X , d ) n input points A ; want to find k centers Objective function ( k -median): min d ( p

CURE : An Ecient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev

China *Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE *Army of the First

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streams Sudipto Guha (UPenn)

Graphs and Linear Measurements Sudipto Guha University of Pennsylvania (based on joint work with

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

February 21, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Protein Design n Determine amino

Computational Drug Discovery Guha. January 10, 2006 Two Revolutions Guha. January 10, 2006 A

February 14, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Pareto Optimality From

January 31, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Laws of Thermodynamics Energy is

Deriving SGD for Neural Networks Swarthmore College CS63 Spring 2018 A neural network NN computes

Summary of test results of MQXFS1 the first short model 150 mm aperture Nb 3 Sn quadrupole for

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

Application of Application of Loss Metrics to Loss Metrics to Multimedia Multimedia Streams

Optimizing the Partial AUC Harikrishna Narasimhan and Shivani Agarwal Department of Computer

Work Zone Data Initiative Briefing How many active work zones were there in the US last

How to ask questions and comment: Please use the Q&amp;A pod to comments and ask questions:

2/26/19 Webinar: School Breakfast Success Please make sure your computer volume is turned ON.

China Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE Army of the First

How to ask questions and comment: Please use the Q&A pod to comments and ask questions: