streaming algorithms for k center clustering with
play

Streaming algorithms for k -center clustering with outliers and with - PowerPoint PPT Presentation

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu k -center clustering problem Input: n points in an arbitrary


  1. Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu

  2. k -center clustering problem ● Input: n points in an arbitrary metric space. ● Goal: Partition them into k clusters and assign each a center point to minimize the maximum distance from an input point to its cluster center. k = 3

  3. k -center clustering problem ● Input: n points in an arbitrary metric space. ● Goal: Partition them into k clusters and assign each a center point to minimize the maximum distance from an input point to its cluster center. k = 3

  4. k -center clustering problem ● Input: n points in an arbitrary metric space. ● Goal: Partition them into k clusters and assign each a center point to minimize the maximum distance from an input point to its cluster center. k = 2

  5. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3

  6. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R

  7. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R

  8. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R 2R ⇒ k clusters, but points left uncovered R too small. Start over w/ bigger guess.

  9. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R

  10. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R Covers entire optimal cluster

  11. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R Covers entire optimal cluster

  12. Greedy 2-approximation (Hochbaum and Shmoys, 1985) ● Greedily make clusters of radius 2R centered at uncovered points ● Take smallest R for which ≤ k clusters suffice OPT R k = 3 2R 2R 2R Covers entire optimal cluster We are guaranteed to succeed w/ ≤ k clusters whenever R ≥ OPT no matter how badly we choose centers, hence we'll take R ≤ OPT.

  13. Streaming model ● Data set too large to fit in memory ● Receive points one at a time (can't start over!) ● Maintain small state, incl. solution for input so far ● Return solution when end of input is reached Solution for input so far Points Small Large data set state

  14. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  15. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  16. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  17. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  18. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  19. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R

  20. Doubling Algorithm (Charikar et al., STOC 1997) ● State: – Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so far is within 8R of a stored center ⇒ Stored centers give an 8-approximation at any time – ● If an input point is within 8R of a stored center, then drop it, otherwise store it. R k = 2 8R 8R 8R

  21. Doubling Algorithm: raising R ● Oops, we have > k stored centers! – Must drop some and account for the input points they covered within distance 8R. – Obs: Some optimal cluster must cover two stored centers, so OPT ≥ (shortest pairwise distance)/2. – Assuming that stored centers are always separated by 4R, we can raise R to R new = (4R)/2 = 2R. R new R k = 2 8R OPT? 8R 8R ≥ 2 R n e w

  22. Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R 8R

  23. Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R new 8R 4R n e w

  24. Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R k = 2 8R 8R new 4R n e w

  25. Doubling Algorithm: merging step ● Oops, we have > k stored centers! – Restore separation invariant by letting each center greedily subsume others within 4R new . – Every input point belonging to a subsumed center is ⇒ within 4R new + 8R = 8R new of a kept center covered. R new R 8R new k = 2 4R 8R new n e w 4R n e w

  26. Doubling Algorithm: conclusion ● Proceed... ● When end of input is reached, return clusters of radius 8R at stored centers. An 8-approximation. R new 8R new k = 2 8R new

  27. k -center clustering with outliers ● Application: Noisy data ● Clustering can miss up to z input points k = 3, z = 2

  28. k -center clustering with outliers ● Application: Noisy data ● Clustering can miss up to z input points k = 3, z = 2 outliers

  29. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3

  30. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3

  31. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 must have ≥ 3 points

  32. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3

  33. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 If this point were not in the input...

  34. k -center clustering “with anonymity” ● Application: Publish per-cluster statistics without revealing too much about any single input point ● Each cluster gets ≥ b points k = 3, b = 3 If this point were not in the input... each point can “belong” to only one cluster even if it is within the radii of several

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend