mapreduce and streaming algorithms for center based
play

MapReduce and Streaming Algorithms for Center-Based Clustering in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci DEI, University of Padova, Italy Based on joint works with: M. Ceccarello, A. Mazzetto, and A. Pietracaprina 1 Center-based clustering


  1. MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci DEI, University of Padova, Italy Based on joint works with: M. Ceccarello, A. Mazzetto, and A. Pietracaprina 1

  2. Center-based clustering Center-based clustering in general metric spaces: Given a pointset S in a metric space with distance d ( · , · ), determine a set C ⋆ ⊆ S of k centers minimizing: ◮ max p ∈ S { d ( p , C ⋆ ) } ( k -center ) ◮ � p ∈ S d ( p , C ⋆ ) ( k -median ) p ∈ S ( d ( p , C ⋆ )) 2 ( k -means ) ◮ � Remark: On general metric spaces it makes sense to require that C ⋆ ⊆ S . This assumption is often relaxed in Euclidean spaces (continuos vs discrete version) Variant: Center-based clustering with z outliers: Disregard the z largest distances when computing the objective function. 2

  3. Example: pointset instance 3

  4. Example: solution to 4-center optimal radius r ⋆ ( k ) = max distance of x ∈ S from C ⋆ 4

  5. Example: solution to 4-center with 2 outliers optimal radius r ⋆ ( k , z ) = max distance of non-outlier x ∈ S 5

  6. Center-based clustering for big data 1. Deal with very large pointsets ◮ MapReduce distributed setting ◮ Streaming setting 2. Aim: try to match best sequential approximation ratios with limited local/working space 3. Very simple algorithms with good practical performance 4. Concentrate on k -center with and without outliers [CeccarelloPietracaprinaP, VLDB2019]. 5. End of the talk: sketch very recent results for k -median and k -means [MazzettoPietracaprinaP, arXiv 2019] 6

  7. Outline ◮ Background ◮ MapReduce and Streaming models ◮ Previous work ◮ Doubling Dimension ◮ k center (with and without outliers): ◮ Summary of results ◮ Coreset selection: main idea ◮ MapReduce algorithms ◮ Porting to the Streaming setting ◮ Experiments ◮ Sketch of new results for k -median and k -means 7

  8. Background: MapReduce and Streaming models MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data ( key-value pairs) are mapped by key into subsets and processed in parallel by reducers equipped with small local memory ◮ Goals: few rounds, (substantially) sublinear local memory, linear aggregate memory. Streaming ◮ Data provided as a continuous stream and processed using small working memory ◮ Multiple passes over data may be allowed ◮ Goals: 1 (or few) pass(es), (substantially) sublinear working memory 8

  9. Background: Previous work ◮ Sequential algorithms for general metric spaces: ◮ k -center: 2-approximation ( O ( nk ) time) and 2 − ǫ inapproximability [Gonzalez85] ◮ k -center with z outliers: 3-approximation ( O � n 2 k log n � time) [Charikar+01] ◮ MapReduce algorithms: Reference Rounds Approx. Local Memory k -center problem k 2 | S | ǫ � � [Ene+11] (w.h.p.) O (1 /ǫ ) 10 O � ( | S | k ) 1 / 2 � [Malkomes+15] 2 4 O k -center problem with z outliers � ( | S | ( k + z )) 1 / 2 � [Malkomes+15] 2 13 O 9

  10. Background: Previous work (cont’d) ◮ Streaming algorithms: Reference Passes Approx. Working Memory k -center problem k ǫ − 1 log ǫ − 1 � � [McCutchen+08] 1 2 + ǫ O k -center problem with z outliers � kz ǫ − 1 � [McCutchen+08] 1 4 + ǫ O 10

  11. Background: doubling dimension Our algorithms are analyzed in terms of the doubling dimension D of the metric space: ∀ r : any ball of radius r is covered by ≤ 2 D balls of radius r / 2 r r/2 ◮ Euclidean spaces ◮ Shortest-path distances of mildly expanding topologies ◮ Low-dimensional pointsets of high-dimensional spaces 11

  12. Summary of results Our Algorithms Model Rnd/Pass Approx. Local/Working Memory k -center problem �� ǫ ) D � � �� �� | S | k ( 4 MapReduce 2 2 + ǫ (4) | S | k O O k -center problem with z outliers �� ǫ ) D � � �� �� | S | ( k + z ) ( 24 MapReduce 2 3 + ǫ (13) O O | S | ( k + z ) MapReduce ��� � ǫ ) D � ( 24 2 3 + ǫ O | S | ( k + log | S | ) + z (w.h.p.) � � �� ( k + z ) ( 96 ǫ ) D kz Streaming 1 3 + ǫ (4 + ǫ ) O ǫ ◮ Substantial improvement in approximation quality at the expense of larger memory requirements (constant factor for constant ǫ, D ) ◮ MR algorithms are oblivious to D ◮ Large constants due to the analysis. Experiments show practicality of our approach. 12

  13. Summary of results (cont’d) Main features ◮ (Composable) coreset approach: select small T ⊆ S containing good solution for S and then run (adaptation of) best sequential approximation on T ◮ Flexibility: coreset construction can be either distributed (MapReduce) or streamlined (Streaming) ◮ Adaptivity: Memory/approximation tradeoffs expressed in terms of the doubling dimension d of the pointset ◮ Quality: MR and Streaming algorithms using small memory and almost matching best sequential approximations. 13

  14. Coreset selection: main idea ◮ Let r ⋆ = max distance of any (non-outlier) x ∈ S from closest optimal center ◮ Select a coreset T ⊆ S ensuring that d ( x , T ) ≤ ǫ r ⋆ ∀ x ∈ S − T using sequential h -center approximation, for h suitably larger than k . (Similar idea in [CeccarelloPietracaprinaPUpfal17] for diversity maximization → next talk) ◮ Obs: in general, T must contain outliers 14

  15. Example: pointset instance 15

  16. Example: optimal solution k =4, z =2 16

  17. Example: 10-point coreset T (red points) 17

  18. MapReduce algorithms Basic primitive for coreset selection (based on [Gonzalez85]) Select( S ′ , h , ǫ ): Input: Subset S ′ ⊆ S , parameters h , ǫ > 0 Output: Coreset T ⊆ S ′ of size ≥ h T ← arbitrary point c 1 ∈ S ′ r (1) ← max distance of any x ∈ S ′ from T for i = 2 , 3 , . . . do Find farthest point c i ∈ S ′ from T , and add it to T r ( i ) ← max distance of any x ∈ S ′ from T if (( i ≥ h ) AND ( r ( i ) ≤ ( ǫ/ 2) r ( h ))) then return T Lemma: Let r ∗ ( h ) be the optimal h -center radius for the entire set S and let last the index of the last iteration of Select. Then: r ( last ) ≤ ǫ r ∗ ( h ) Proof idea: by a simple adaptation of Gonzalez’s proof, r ( i = h ) ≤ 2 r ∗ ( h ) 18

  19. MapReduce algorithms: k -center 19

  20. MapReduce algorithms: k -center (cont’d) Analysis ◮ Approximation quality: let C = { c 1 , . . . , c k } be the returned centers. For any x ∈ S j (arbitrary j ) d ( x , C ) ≤ d ( x , t ) + d ( t , C ) ( t ∈ T j closest to x ) ǫ r ⋆ ( k ) + 2 r ⋆ ( k ) = (2 + ǫ ) r ⋆ ( k ) ≤ ◮ Memory requirements: assume doubling dimension D ◮ set ℓ = � | S | / k ◮ Technical lemma: | T j | ≤ k (4 /ǫ ) D , for every 1 ≤ j ≤ ℓ ⇒ Local memory = O �� | S | k (4 /ǫ ) D � . Remarks: ◮ For constant ǫ and D : (2 + ǫ )-approximation with the same memory requirements as the 4-approximation in [Malkomes+15] ◮ Our algorithm is oblivious to D 20

  21. MapReduce algorithms: k -center with z outliers Similar approach to the case without outliers but with some important differences 1. Each coreset T j ⊆ S j must contain ≥ k + z points (making room for outliers) 2. Each t ∈ T j has a weight w ( t ) = number of points of S j − T j for which t is proxy (i.e., closest). Let T w denote the set T j with weights. j 3. On T w = � j T w a suitable weighted variant of the algorithm in [Charikar01+] j (dubbed Charikar w) is run which: ◮ determines k suitable centers (final solution) covering most points of T w ◮ uncovered points of T w have aggregate weight z and are the proxies of the outliers 21

  22. MapReduce algorithms: k -center with z outliers (cont’d) 22

  23. MapReduce algorithms: k -center with z outliers (cont’d) Analysis ◮ Approximation quality: let C = { c 1 , . . . , c k } be the returned centers. For any non-outlier x ∈ S j (arbitrary j ) with proxy t ∈ T w j d ( x , t ) ≤ ǫ r ⋆ ( k , z ) and d ( t , C ) ≤ (3 + 5 ǫ ) r ⋆ ( k , z ) ⇒ (3 + ǫ ′ )-approximation for every ǫ ′ > 0. ◮ Memory requirements: assume doubling dimension D ◮ set ℓ = � | S | / ( k + z ) ◮ Technical lemma: | T j | ≤ ( k + z )(4 /ǫ ) D , for every 1 ≤ j ≤ ℓ ⇒ Local memory = O �� | S | ( k + z )(4 /ǫ ) D � . Remarks: ◮ For constant ǫ and D : (3 + ǫ )-approximation with the same memory requirements as the 13-approximation in [Malkomes+15] ◮ Our algorithm is oblivious to D 23

  24. MapReduce algorithms: k -center with z outliers (cont’d) Randomized Variant ◮ Create S 1 , S 2 , . . . , S ℓ as a random partition ( ⇒ z ′ = O ( z /ℓ + log | S | ) outliers per partition w.h.p. ) ◮ Execute the deterministic algorithm with z ′ in lieu of z Analysis ◮ Approximation quality: (3 + ǫ ′ ) (as before) ��� � (24 /ǫ ) D � ◮ Memory requirements: O | S | ( k + log | S | ) + z Remark: �� � ◮ For constant ǫ and D : O | S | ( k + log | S | ) + z local memory (linear dependence on z desirable) 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend