Streaming algorithms for k -center clustering with outliers and with - - PowerPoint PPT Presentation

streaming algorithms for k center clustering with
SMART_READER_LITE
LIVE PREVIEW

Streaming algorithms for k -center clustering with outliers and with - - PowerPoint PPT Presentation

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu k -center clustering problem Input: n points in an arbitrary


slide-1
SLIDE 1

Streaming algorithms for k-center clustering with outliers and with anonymity

Richard Matthew McCutchen and Samir Khuller University of Maryland {rmccutch,samir}@cs.umd.edu

slide-2
SLIDE 2

k-center clustering problem

  • Input: n points in an arbitrary metric space.
  • Goal: Partition them into k clusters and assign

each a center point to minimize the maximum distance from an input point to its cluster center.

k = 3

slide-3
SLIDE 3

k-center clustering problem

  • Input: n points in an arbitrary metric space.
  • Goal: Partition them into k clusters and assign

each a center point to minimize the maximum distance from an input point to its cluster center.

k = 3

slide-4
SLIDE 4

k-center clustering problem

  • Input: n points in an arbitrary metric space.
  • Goal: Partition them into k clusters and assign

each a center point to minimize the maximum distance from an input point to its cluster center.

k = 2

slide-5
SLIDE 5

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

k = 3 R OPT (Hochbaum and Shmoys, 1985)

slide-6
SLIDE 6

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

k = 3 R OPT

2R

(Hochbaum and Shmoys, 1985)

slide-7
SLIDE 7

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

k = 3 R OPT

2R 2R

(Hochbaum and Shmoys, 1985)

slide-8
SLIDE 8

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

k = 3 k clusters, but points left uncovered R too small. Start over w/ bigger guess. ⇒ R OPT

2R 2R

(Hochbaum and Shmoys, 1985)

2R

slide-9
SLIDE 9

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

R OPT 2R k = 3 (Hochbaum and Shmoys, 1985)

slide-10
SLIDE 10

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

R OPT

Covers entire optimal cluster

2R k = 3 (Hochbaum and Shmoys, 1985)

slide-11
SLIDE 11

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

R OPT

Covers entire optimal cluster

2R 2R k = 3 (Hochbaum and Shmoys, 1985)

slide-12
SLIDE 12

Greedy 2-approximation

  • Greedily make clusters of radius 2R centered at

uncovered points

  • Take smallest R for which ≤ k clusters suffice

We are guaranteed to succeed w/ ≤ k clusters whenever R ≥ OPT no matter how badly we choose centers, hence we'll take R ≤ OPT.

R OPT

Covers entire optimal cluster

2R 2R 2R k = 3 (Hochbaum and Shmoys, 1985)

slide-13
SLIDE 13

Streaming model

  • Data set too large to fit in memory
  • Receive points one at a time (can't start over!)
  • Maintain small state, incl. solution for input so far
  • Return solution when end of input is reached

Small state Solution for input so far Large data set Points

slide-14
SLIDE 14

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-15
SLIDE 15

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-16
SLIDE 16

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-17
SLIDE 17

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-18
SLIDE 18

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-19
SLIDE 19

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-20
SLIDE 20

Doubling Algorithm

  • State:

– Lower bound R on optimal radius – ≤ k “stored centers” such that every input point read so

far is within 8R of a stored center

⇒ Stored centers give an 8-approximation at any time

  • If an input point is within 8R of a stored center,

then drop it, otherwise store it.

R 8R 8R 8R k = 2 (Charikar et al., STOC 1997)

slide-21
SLIDE 21

Doubling Algorithm: raising R

  • Oops, we have > k stored centers!

– Must drop some and account for the input points they

covered within distance 8R.

– Obs: Some optimal cluster must cover two stored

centers, so OPT ≥ (shortest pairwise distance)/2.

– Assuming that stored centers are always separated by

4R, we can raise R to Rnew = (4R)/2 = 2R.

R 8R 8R 8R k = 2 Rnew ≥ 2 R

n e w

OPT?

slide-22
SLIDE 22

Doubling Algorithm: merging step

  • Oops, we have > k stored centers!

– Restore separation invariant by letting each center

greedily subsume others within 4Rnew.

– Every input point belonging to a subsumed center is

within 4Rnew + 8R = 8Rnew of a kept center covered. ⇒

R 8R 8R k = 2 Rnew 8R

slide-23
SLIDE 23

Doubling Algorithm: merging step

  • Oops, we have > k stored centers!

– Restore separation invariant by letting each center

greedily subsume others within 4Rnew.

– Every input point belonging to a subsumed center is

within 4Rnew + 8R = 8Rnew of a kept center covered. ⇒

R 8R 8R k = 2 Rnew 8Rnew 4R

n e w

slide-24
SLIDE 24

8Rnew

Doubling Algorithm: merging step

  • Oops, we have > k stored centers!

– Restore separation invariant by letting each center

greedily subsume others within 4Rnew.

– Every input point belonging to a subsumed center is

within 4Rnew + 8R = 8Rnew of a kept center covered. ⇒

R 8R k = 2 Rnew 4R

n e w

slide-25
SLIDE 25

Doubling Algorithm: merging step

  • Oops, we have > k stored centers!

– Restore separation invariant by letting each center

greedily subsume others within 4Rnew.

– Every input point belonging to a subsumed center is

within 4Rnew + 8R = 8Rnew of a kept center covered. ⇒

8Rnew R k = 2 Rnew 4R

n e w

8Rnew 4R

n e w

slide-26
SLIDE 26

Doubling Algorithm: conclusion

  • Proceed...
  • When end of input is reached, return clusters of

radius 8R at stored centers. An 8-approximation.

8Rnew k = 2 Rnew 8Rnew

slide-27
SLIDE 27

k-center clustering with outliers

  • Application: Noisy data
  • Clustering can miss up to z input points

k = 3, z = 2

slide-28
SLIDE 28

k-center clustering with outliers

  • Application: Noisy data
  • Clustering can miss up to z input points

k = 3, z = 2

  • utliers
slide-29
SLIDE 29

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3

slide-30
SLIDE 30

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3

slide-31
SLIDE 31

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3 must have ≥ 3 points

slide-32
SLIDE 32

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3

slide-33
SLIDE 33

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3 If this point were not in the input...

slide-34
SLIDE 34

k-center clustering “with anonymity”

  • Application: Publish per-cluster statistics without

revealing too much about any single input point

  • Each cluster gets ≥ b points

k = 3, b = 3 each point can “belong” to only one cluster even if it is within the radii of several If this point were not in the input...

slide-35
SLIDE 35

Currently known results

Problem Model Algorithm Memory Offline 2 2 Streaming 8 k Outliers Offline 4 1 Streaming 4 Scaling w/ support points 1

(With enumerable centers, all 4s become 3.)

Anonymity Offline 2 Streaming fr fz Basic k-center Greedy (Hochbaum-Shmoys '85) Farthest-point (Gonzalez '85) Doubling (Charikar et al. STOC '97) Parallelized scaling 2+ε ε-1ln(ε-1)k Greedy (Charikar et al. SODA '01) Sampling (Charikar et al. STOC '03) 1+ε ε-2k(n/z) 4+ε ε-1kz Flow (Aggarwal et al. PODS '06) Add scaling pass (not in this talk) 6+ε ε-1ln(ε-1)k + k2

(Our contributions) Approximation factor in... cluster radius number of outliers

slide-36
SLIDE 36

“Scaling” Algorithm

  • Doubling Algorithm generalized to an arbitrary

scaling factor : α

– Rnew = R when we rule out an optimal solution < R

α α

– Centers separated by 2 R

α

– Every point within R = [2

η α2/( − 1)]R of a stored center α

  • Merging: 2 R

α

new + R = 2 [1 + 1/( − 1)]R

η α α

new = R

η

new

  • = 2 minimizes = 8

α η

ηR ( η αR)

slide-37
SLIDE 37

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-38
SLIDE 38

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-39
SLIDE 39

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-40
SLIDE 40

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-41
SLIDE 41

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-42
SLIDE 42

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-43
SLIDE 43

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-44
SLIDE 44

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 = 3, = 2(3 α η

2)/(3 − 1) = 9

(Similar construction found independently by Guha, 2007)

slide-45
SLIDE 45

= 3, = 2(3 α η

2)/(3 − 1) = 9

Parallelized Scaling Algorithm

  • m instances with interleaved sequences of R

values

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 R R R R η R η R η k = 2 m = 3 (Similar construction found independently by Guha, 2007)

slide-46
SLIDE 46

Benefit of parallelization

  • Take best solution produced by

any instance.

  • Instance whose R sequence has

R1 will give a 2[1 + 1/( − 1)] α α1/m- approximation.

  • Good approximation: make large

α and m even larger!

  • Obtain a (2 + )-approximation

ε with = O( α ε-1) and m = O(ε-1ln(ε-1))

1 1.44 2.08 3 4.33 6.24 9 12.98 18.72 27 38.94 56.16 81 OPT R1 ( / )R η α

1

R1/α / η α ≤ α1/m = 3, m = 3, = 9: α η 4.33-approximation

slide-47
SLIDE 47

Streaming k-center with outliers

  • Commonalities with the Scaling Algorithm:

– Keep s ≤ k stored centers and a lower bound R on OPT – Solution contains clusters of radius R at stored

η

  • centers. Drop input points arriving in those clusters.

– Set Rnew = R when solution becomes infeasible

α

  • Problem: OPT no longer guaranteed to cover all
  • ur stored centers

having > ⇒ k well-separated stored centers no longer means we can raise R.

R η R η R η OPT doesn't cover this stored center 2Rnew? k = 2, z = 4

slide-48
SLIDE 48

Coping with outliers

  • Solution: Don't store a center until it has z + 1

“support points” within R (for some parameter ) β β . OPT can't mark all the support points as outliers, so it must cover at least one of them.

  • Keep each input point as a “free point” until it can

be stored as a center or dropped.

k = 3, z = 4, s = 1 R η R β

slide-49
SLIDE 49

Coping with outliers

  • Solution: Don't store a center until it has z + 1

“support points” within R (for some parameter ) β β . OPT can't mark all the support points as outliers, so it must cover at least one of them.

  • Keep each input point as a “free point” until it can

be stored as a center or dropped.

R η R β R β OPT could mark all these as

  • utliers, so don't store yet.

k = 3, z = 4, s = 1

slide-50
SLIDE 50

Coping with outliers

  • Solution: Don't store a center until it has z + 1

“support points” within R (for some parameter ) β β . OPT can't mark all the support points as outliers, so it must cover at least one of them.

  • Keep each input point as a “free point” until it can

be stored as a center or dropped.

R η R β R β OPT must cover one of these z + 1 points. k = 3, z = 4, s = 1

slide-51
SLIDE 51

Coping with outliers

  • Solution: Don't store a center until it has z + 1

“support points” within R (for some parameter ) β β . OPT can't mark all the support points as outliers, so it must cover at least one of them.

  • Keep each input point as a “free point” until it can

be stored as a center or dropped.

R η R β R η R β k = 3, z = 4, s = 2

slide-52
SLIDE 52

Separation of support, free points

  • Assuming R ≥ R + 2 R, support points of

η β α different centers will be separated by 2 R, so an α

  • ptimal cluster of radius < R can satisfy at most

α

  • ne stored center.

R η R β ≥ 2 R α k = 3, z = 4, s = 2 < 2 R α doesn't reach

slide-53
SLIDE 53

Separation of support, free points

  • Assuming R ≥ R + 2 R, support points of

η β α different centers will be separated by 2 R, so an α

  • ptimal cluster of radius < R can satisfy at most

α

  • ne stored center.
  • ⇒ OPT must set aside s clusters to satisfy stored
  • centers. (If s > k, raise R.)

R η R β ≥ 2 R α k = 3, z = 4, s = 2 need another cluster

slide-54
SLIDE 54

Separation of support, free points

  • Assuming R ≥ R + 2 R, support points of

η β α different centers will be separated by 2 R, so an α

  • ptimal cluster of radius < R can satisfy at most

α

  • ne stored center.
  • ⇒ OPT must set aside s clusters to satisfy stored
  • centers. (If s > k, raise R.)

– These clusters can't additionally cover free points.

R η R β ≥ 2 R α k = 3, z = 4, s = 2 doesn't reach < 2 R α

slide-55
SLIDE 55

Covering free points

  • Can OPT cover the free points with the other k – s

clusters of radius < R and z outliers? α

R η R β ≥ 2 R α k = 3, z = 4, s = 2 cover free points? k – s = 1 z = 4 < R α

slide-56
SLIDE 56

Covering free points

  • Can OPT cover the free points with the other k – s

clusters of radius < R and z outliers? α

  • Try the 4-approximation offline algorithm with

guess R. α

– Success: Have a feasible solution at R, assuming

η R ≥ 4 R. η α

R η R β ≥ 2 R α k = 3, z = 4, s = 2 4 R α

slide-57
SLIDE 57

Covering free points

  • Can OPT cover the free points with the other k – s

clusters of radius < R and z outliers? α

  • Try the 4-approximation offline algorithm with

guess R. α

– Success: Have a feasible solution at R, assuming

η R ≥ 4 R. η α

– Failure: OPT can't do it

raise R. ⇒

R η R β ≥ 2 R α k = 3, z = 4, s = 2 impossible 4-approximation algorithm fails < R α 4 R α

slide-58
SLIDE 58

Controlling memory usage

  • Must limit number of free points for O(kz) memory

bound.

  • Assuming R ≥ 2 R, any optimal cluster of radius

β α < R that covers more than z free points will have α

  • ne of its points stored as a center.

z = 4 < 2 R α R β > z

slide-59
SLIDE 59

Controlling memory usage (2)

  • An optimal solution of radius < R must cover the

α free points with k – s clusters and z outliers.

  • After center-storing, each of those clusters covers

≤ z free points.

  • ⇒ If there are > (k – s)z + z free points, it's

impossible, so we raise R to maintain the memory bound.

≥ 2 R α R η R β ≤ 4 impossible k = 3, z = 4, s = 2

slide-60
SLIDE 60

Controlling memory usage (2)

  • An optimal solution of radius < R must cover the

α free points with k – s clusters and z outliers.

  • After center-storing, each of those clusters covers

≤ z free points.

  • ⇒ If there are > (k – s)z + z free points, it's

impossible, so we raise R to maintain the memory bound.

≥ 2 R α R η R β ≤ 4 impossible Even though we have a feasible solution, we must raise R to keep within O(kz) memory. k = 3, z = 4, s = 2

slide-61
SLIDE 61

Merging step

  • Must restore separation of 2 R between support

α points of different centers after raising R.

  • Greedy merging pass as in Scaling Algorithm:

– Each center c subsumes any centers with support

points closer than 2 R α

new to a support point of c.

  • Assume R +

β 2 ( R) α α + R + R ≤ ( R). β η η α

< 2 R α

new

R β R β R η

c

R η

new

slide-62
SLIDE 62

Streaming k-center w/ outliers: result

R ≥ R + 2 R η β α R ≥ 4 R η α R ≥ 2 R β α R + 2 ( R) + R + R ≤ ( R) β α α β η η α = 4, = 8, = 16 α β η m-instance parallelization: (41 + 1/m)-approximation

(4 + ε)-approximation in radius, no more than z outliers, with O(ε-1kz) memory

slide-63
SLIDE 63

Future work

  • Outliers (clustering can miss up to z points):

– Reduce memory usage to O(ε-1(k + z))

  • We think we have a (14 + ε)-approximation (not fully verified)
  • Would match (

Ω k + z) lower bound for deterministic algos

– Memory requirement when neither z nor n/z is small?

  • Anonymity (each cluster gets ≥ b points):

– Reduce approximation factor from 6 + ε to 2 + ε?

  • Streaming algorithm for outliers + anonymity

– Offline 4-approximation is known

  • Multi-pass algorithms

Algorithm Memory 4 Scaling w/ support points 1 fr fz Sampling (Charikar et al. STOC '03) 1+ε ε-2k(n/z) 4+ε ε-1kz