Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - - PowerPoint PPT Presentation

distributed partial clustering sudipto guha qin zhang yi
SMART_READER_LITE
LIVE PREVIEW

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - - PowerPoint PPT Presentation

Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July 25, 2017 1-1 Clustering Metric space ( X , d ) n input points A ; want to find k centers Objective function ( k -median): min d ( p


slide-1
SLIDE 1

1-1

Distributed Partial Clustering

SPAA 2017 July 25, 2017

Qin Zhang IUB Sudipto Guha Upenn Yi Li NTU

slide-2
SLIDE 2

2-1

Clustering

  • Metric space (X, d)
  • n input points A; want to find k centers
  • Objective function (k-median):

min

K⊆A:|K|=k

  • p∈A

d(p, K)

  • k-means:

p∈A d2(p, K)

k-center: maxp∈A d(p, K)

slide-3
SLIDE 3

3-1

Clustering with outliers

  • Metric space (X, d)
  • n input points A; want to find k centers, t outliers
  • Objective function ((k, t)-median):

min

K,O⊆A:|K|=k,|O|≤t

  • p∈A\O

d(p, K)

  • (k, t)-means:

p∈A\O d2(p, K)

(k, t)-center: maxp∈A\O d(p, K) Motivation: partial optimization gives much better results

slide-4
SLIDE 4

4-1

Distributed clustering

  • s sites, coordinator model
  • Site i gets Ai, parties want to cluster A = A1 ∪ . . . ∪ As
  • Want to minimize comm. cost and #comm. rounds
  • For simiplicity, assume each point takes ˜

O(1) bits Motivation: data is inherently distributed / data is big and does not fit one machine

· · ·

S1 S2 S3 Ss C

  • ne round

A1 A2 A3 As ∅ Coordinator model

slide-5
SLIDE 5

5-1

Clustering on uncertain data

  • Each data item j is a distribution; call it a node.

Motivation: data is noisy; a subfield in databases Let σ(j) denote a realization, π(j) the center point to which j is attached.

slide-6
SLIDE 6

5-2

Clustering on uncertain data

  • Each data item j is a distribution; call it a node.

Motivation: data is noisy; a subfield in databases Let σ(j) denote a realization, π(j) the center point to which j is attached.

  • Objective function (k-median):

min

K⊆A:|K|=k

  • j∈A

Eσ[d(σ(j), π(j))]

  • k-means: replace d(p, K) with d2(p, K).
  • k-center has two versions: E and max do not commute.

– maxj∈A E[d(σ(j), π(j))] – E [maxj∈A d(σ(j), π(j))]

slide-7
SLIDE 7

6-1

Clustering with outlier on uncertain data

  • Each data item j is a distribution; call it a node.

Let σ(j) denote a realization, π(j) the center point to which j is attached.

  • Objective function ((k, t)-median):

min

K,O⊆A:|K|=k,|O|≤t

  • p∈A\O

Eσ[d(σ(j), π(j))]

  • (k, t)-means: replace d(p, K) with d2(p, K).
  • (k, t)-center has two versions: E and max do not commute.

– maxj∈A\O E[d(σ(j), π(j))] – E

  • maxj∈A\O d(σ(j), π(j))
  • (k, t)-center-pp

(k, t)-center-global

slide-8
SLIDE 8

7-1

Problems studied before

  • Clustering
  • Clustering with outliers
  • Clustering on uncertain data
  • Distributed clustering
  • Distributed clustering with outliers for k-center

Old and New Problems

New problems

  • Distributed clustering with outliers for k-median/means
  • Distributed clustering (with outliers) for uncertain data

[CKMN, 2001] [CM, 2008] [??, XXXX] [??, XXXX] [MKCWM, 2015] Implicitly also in [GMMMO, 2003]

slide-9
SLIDE 9

7-2

Problems studied before

  • Clustering
  • Clustering with outliers
  • Clustering on uncertain data
  • Distributed clustering
  • Distributed clustering with outliers for k-center

Old and New Problems

New problems

  • Distributed clustering with outliers for k-median/means
  • Distributed clustering (with outliers) for uncertain data

This paper This paper This paper [CKMN, 2001] [CM, 2008] [??, XXXX] [??, XXXX] [MKCWM, 2015] Implicitly also in [GMMMO, 2003]

slide-10
SLIDE 10

8-1

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

slide-11
SLIDE 11

8-2

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

Main results (all in 2 rounds; under the same framework):

  • (O(1), 1)-approx with ˜

O(sk + t) comm. for (k, t)-median/center

slide-12
SLIDE 12

8-3

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

Main results (all in 2 rounds; under the same framework):

  • (O(1), 1)-approx with ˜

O(sk + t) comm. for (k, t)-median/center

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time

slide-13
SLIDE 13

8-4

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

Main results (all in 2 rounds; under the same framework):

  • (O(1), 1)-approx with ˜

O(sk + t) comm. for (k, t)-median/center

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)

slide-14
SLIDE 14

8-5

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

Main results (all in 2 rounds; under the same framework):

  • (O(1), 1)-approx with ˜

O(sk + t) comm. for (k, t)-median/center

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for uncertain (k, t)-median/means and (k, t)-center-pp Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)

slide-15
SLIDE 15

8-6

Main results

Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points

Main results (all in 2 rounds; under the same framework):

  • (O(1), 1)-approx with ˜

O(sk + t) comm. for (k, t)-median/center

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + t) comm. for uncertain (k, t)-median/means and (k, t)-center-pp

  • ((1 + 1/ǫ), 1 + ǫ)-approx with ˜

O(sk + tI + s log ∆) comm. for uncertain (k, t)-center-global, where I is the info to encode the distribution, and ∆ is the max-distance/min-distance Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)

slide-16
SLIDE 16

9-1

Previous results on distributed clustering with outliers

  • ˜

O(sk + st) bits in 2 rounds for k-center

(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)

  • ˜

O(sk + st) bits in 1 round for k-median/means/center

Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)

slide-17
SLIDE 17

9-2

Previous results on distributed clustering with outliers

  • ˜

O(sk + st) bits in 2 rounds for k-center

(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)

  • ˜

O(sk + st) bits in 1 round for k-median/means/center

Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)

  • Interesting range of parameters: n ≫ t ≫ k, s.

Consider a modest data set, say n = 108. Suppose that 0.1% is noise, thus t = 0.001 × n = 105. Say s = 1000, and k = 100. Then sk = 105, and st = 108 Consequently sk + st = 108 while sk + t = 105

slide-18
SLIDE 18

9-3

Previous results on distributed clustering with outliers

  • ˜

O(sk + st) bits in 2 rounds for k-center

(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)

  • ˜

O(sk + st) bits in 1 round for k-median/means/center

Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)

  • Interesting range of parameters: n ≫ t ≫ k, s.

Consider a modest data set, say n = 108. Suppose that 0.1% is noise, thus t = 0.001 × n = 105. Say s = 1000, and k = 100. Then sk = 105, and st = 108 Consequently sk + st = 108 while sk + t = 105

  • Goal: reduce the st term to t, since

the difference ⇒ your data/energy/time bill.

slide-19
SLIDE 19

10-1

More related work

(Centralized) clustering with outliers

  • 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median

(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)

  • (k, t)-median with different loss functions. (Feldman, Schulman, 2012)
slide-20
SLIDE 20

10-2

More related work

(Centralized) clustering with outliers

  • 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median

(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)

  • (k, t)-median with different loss functions. (Feldman, Schulman, 2012)

Uncertain data

  • Uncertain k-center/median/means (Cormode, McGregor, 2008)
  • Better results for k-center (Guha, Munagala, 2009)
slide-21
SLIDE 21

10-3

More related work

(Centralized) clustering with outliers

  • 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median

(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)

  • (k, t)-median with different loss functions. (Feldman, Schulman, 2012)

Uncertain data

  • Uncertain k-center/median/means (Cormode, McGregor, 2008)
  • Better results for k-center (Guha, Munagala, 2009)

Distributed clustering (coordinator model)

  • O(1)-approx with ˜

O(kd + sk) for k-median/means in d-dim Euclidean space (Balcan, Ehrlich, Liang, 2013)

  • Better results for k-means by (Liang, Balcan, Kanchanapally,

Woodruff, 2014), and (Cohen, Elder, Musco, Musco, Persu, 2015).

slide-22
SLIDE 22

11-1

Distributed (k, t)-median and the Algorithm Framework

slide-23
SLIDE 23

12-1

Two-level distributed clustering (GMMMO, 2003)

  • Consider k-median. Let A = A1 ∪ . . . ∪ As.
  • Site i computes Mi which is the set of centers of the local

k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions

  • Coordinator computes weighted clustering on M
slide-24
SLIDE 24

12-2

Two-level distributed clustering (GMMMO, 2003)

  • Consider k-median. Let A = A1 ∪ . . . ∪ As.
  • Site i computes Mi which is the set of centers of the local

k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions

  • Coordinator computes weighted clustering on M

Any O(1)-approx, but can assume

  • ptimal for now
slide-25
SLIDE 25

12-3

Two-level distributed clustering (GMMMO, 2003)

  • It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,

and L = O(Copt(A, k)); we thus get O(1)-approx

  • Consider k-median. Let A = A1 ∪ . . . ∪ As.
  • Site i computes Mi which is the set of centers of the local

k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions

  • Coordinator computes weighted clustering on M
slide-26
SLIDE 26

12-4

Two-level distributed clustering (GMMMO, 2003)

  • It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,

and L = O(Copt(A, k)); we thus get O(1)-approx

  • Similar result holds for (k, t)-median
  • Consider k-median. Let A = A1 ∪ . . . ∪ As.
  • Site i computes Mi which is the set of centers of the local

k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions

  • Coordinator computes weighted clustering on M
slide-27
SLIDE 27

12-5

Two-level distributed clustering (GMMMO, 2003)

  • It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,

and L = O(Copt(A, k)); we thus get O(1)-approx

  • Similar result holds for (k, t)-median

– Each site computes a local (k, t)-median and then sends both k centers (and their weights) and t outliers to the coordinator – Coordinator performs a second-level clustering

  • Comm. cost: ˜

O(sk + st)

  • Consider k-median. Let A = A1 ∪ . . . ∪ As.
  • Site i computes Mi which is the set of centers of the local

k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions

  • Coordinator computes weighted clustering on M
slide-28
SLIDE 28

13-1

Local solutions

  • Let t∗

i be # excluded points in Ai in OPT(A, k, t)

  • One can show that
  • i∈[s]

Copt(Ai, k, t∗

i ) ≤ O(1) · Copt(A, k, t)

  • Thus if we knows t∗

i , then we can apply the result on

the previous slide. (but of course we do not know t∗

i )

slide-29
SLIDE 29

13-2

Local solutions

  • Let t∗

i be # excluded points in Ai in OPT(A, k, t)

  • One can show that
  • i∈[s]

Copt(Ai, k, t∗

i ) ≤ O(1) · Copt(A, k, t)

  • Thus if we knows t∗

i , then we can apply the result on

the previous slide. (but of course we do not know t∗

i )

  • We want to miminize
  • i∈[s] Csol(Ai, k, ti) s.t.

i∈[s] ti = t.

slide-30
SLIDE 30

14-1

Waterfilling

  • Miminize

i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.

  • Convexify the functions

t t t Csol(A1, k, ·)

S1 S2 S3

slide-31
SLIDE 31

14-2

Waterfilling

  • Miminize

i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.

  • Convexify the functions

Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)

t t t Csol(A1, k, ·) f1(·)

S1 S2 S3

slide-32
SLIDE 32

14-3

Waterfilling

  • Miminize

i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.

  • Convexify the functions

Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)

t t t Csol(A1, k, ·) f1(·)

Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) S1 S2 S3

slide-33
SLIDE 33

14-4

Waterfilling

  • Miminize

i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.

  • Convexify the functions
  • Sort all “slopes” ℓ(i, q) = fi(q − 1) − fi(q) (i ∈ [s], q ∈ [t])

and choose the value η of rank t as threshold. Note: for a fixed i, ℓ(i, q) (q ∈ [t]) are non-increasing

Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)

t t t Csol(A1, k, ·) f1(·)

Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) S1 S2 S3

slide-34
SLIDE 34

14-5

Waterfilling

  • Miminize

i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.

  • Convexify the functions
  • Sort all “slopes” ℓ(i, q) = fi(q − 1) − fi(q) (i ∈ [s], q ∈ [t])

and choose the value η of rank t as threshold. Note: for a fixed i, ℓ(i, q) (q ∈ [t]) are non-increasing

Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)

t t t Csol(A1, k, ·) f1(·)

Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) t1 t2 t3 S1 S2 S3

ti is the number of slopes ℓ(i, ·) at Site i that are at least η.

slide-35
SLIDE 35

15-1

Two-round algorithm

  • Each site i sends ℓ(i, q) to coordinator for

q = 1, 2, 3, . . . , t

  • Coordinator determines the threshold (rank t element)

and sends it to sites

  • Each site i determines ti and sends local centers (and

their weights) and the ti outliers. Recall that

i∈[s] ti = t

  • Coordinator solves the (k, t)-median problem on the

(weighted) centers and outliers

slide-36
SLIDE 36

15-2

Two-round algorithm

  • Each site i sends ℓ(i, q) to coordinator for

q = 1, 2, 3, . . . , t

  • Coordinator determines the threshold (rank t element)

and sends it to sites

  • Each site i determines ti and sends local centers (and

their weights) and the ti outliers. Recall that

i∈[s] ti = t

  • Coordinator solves the (k, t)-median problem on the

(weighted) centers and outliers

  • Comm. cost ˜

O(sk + st). No improvement, hmm?

slide-37
SLIDE 37

16-1

Two-round algorithm (cont.)

  • Each site i sends ℓ(i, q) to coordinator for

q = 1, 2, 4, 8, . . . , t

  • Coordinator determines the threshold (rank 2t element)
  • Each site i determines ti and sends local centers (and

their weights) and the ti outliers Can show that

i∈[s] ti ≤ 3t

  • Coordinator solves the (k, t)-median problem on the

(weighted) centers and outliers

  • Comm. cost ˜

O(sk+t).

slide-38
SLIDE 38

16-2

Two-round algorithm (cont.)

  • Each site i sends ℓ(i, q) to coordinator for

q = 1, 2, 4, 8, . . . , t

  • Coordinator determines the threshold (rank 2t element)
  • Each site i determines ti and sends local centers (and

their weights) and the ti outliers Can show that

i∈[s] ti ≤ 3t

  • Coordinator solves the (k, t)-median problem on the

(weighted) centers and outliers

  • Comm. cost ˜

O(sk+t).

we can do this because ℓ(i, q) are non-increasing for a fixed i

slide-39
SLIDE 39

17-1

Two-round algorithm, bicriteria

  • Each site i sends ℓ(i, q) to coordinator for

q = 1, 2, 4, 8, . . . , t

  • Coordinator determines the threshold (rank 2t element)
  • Each site i determines ti and sends local centers (with

the number of associated points) and the ti outliers. Can show that

i∈[s] ti ≤ 3t

  • Coordinator solves the (k, (1 + ǫ)t)-median problem on

the (weighted) centers and ignored points

  • Comm. cost ˜

O(sk + t). Quadratic time locally

slide-40
SLIDE 40

18-1

Subquadratic-time centralized algorithm

The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n

2+2α 2+α k2)

slide-41
SLIDE 41

18-2

Subquadratic-time centralized algorithm

The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n

2+2α 2+α k2)

– Apply the distributed (k, t)-median/means algo after dividing the set of points arbitrarily into s pieces of size n/s. – The sequential simulation of the s sites takes time ˜ O(s (n/s)1+α k2). – The coordinator requires time ˜ O((sk + t)2) = ˜ O(s2k2) + ˜ O(t2). – Finally balance n1+α = s2+α

slide-42
SLIDE 42

18-3

Subquadratic-time centralized algorithm

The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n

2+2α 2+α k2)

– Apply the distributed (k, t)-median/means algo after dividing the set of points arbitrarily into s pieces of size n/s. – The sequential simulation of the s sites takes time ˜ O(s (n/s)1+α k2). – The coordinator requires time ˜ O((sk + t)2) = ˜ O(s2k2) + ˜ O(t2). – Finally balance n1+α = s2+α

Apply the reduction O(1) times to further reduce the running time to ˜ O(n1.01k2) (assuming t ≤ √n) at the cost of larger (but still O(1)) approx.

slide-43
SLIDE 43

19-1

Distributed (k, t)-Center

slide-44
SLIDE 44

20-1

Gonzalez’s algorithm

Gonzalez’s algorithm for k-center

  • Let S = {z1, . . . , zn} be a data set
  • Choose z1 ∈ S arbitrary as the first center. Let

Zi = {z1, . . . , zi}

  • For i = 2 to n, set zi = arg maxx∈S d(x, Zi−1)

Get an ordering z1, . . . , zn of S

slide-45
SLIDE 45

21-1

Two-round algorithm for distributed (k, t)-center

  • Site i runs Gonzales’s algorithm and obtain a

re-ordering {a1, . . . , ani} of the points in Ai

  • Site i, for each 1 ≤ q ≤ t, computes

ℓ(i, q) ← min

j<k+q d(aj, ak+q)

  • Sites and coordinator sort {ℓ(i, q)}, and then

follow the subsequent steps in the previous

  • framework. In the second level clustering we use

an algo for k-center with exactly t outliers.

slide-46
SLIDE 46

22-1

Uncertain Data

slide-47
SLIDE 47

23-1

Uncertain data – (k, t)-median/means/center-pp

Eσ[d(σ(j), yj)] Usual distance d(yi, yj)

  • Reduce the clustering problems to the deterministic

case; collapse each node/cloud j to its optimal center. yj = arg min

y

Eσ[d(σ(j), y)]

  • Fully connect yj’s using the metric distance
  • Attach a vertex pj to yj, with edge cost Eσ[d(σ(j), yj)]
  • Apply the previous framework on the compressed graph;

dG(u, v) between u, v ∈ G is the shortest path distance

slide-48
SLIDE 48

24-1

Uncertain data – (k, t)-center-global

  • Use the idea from [Guha, Munagala, 2009], reduce

center to median.

  • Use a truncated distance function

Lτ(u, v) = max{d(u, v) − τ, 0} ρτ(j, u) = Eσ[Lτ(σ(j), u)]

  • Perform a parametric search on τ, and then apply
  • ur previous framework
  • Find a τ s.t.

i Csol(Ai, 2k, ti(τ), ρτ) ≈ τ, where

ti(τ) is #local outliers at Site i (after applying the previous framework)

slide-49
SLIDE 49

25-1

Concluding remarks

Open problems

  • Lower bounds. Ω(sk + t) for (k, t)-median/means/center

if the algo needs to output all the outliers. What if not?

  • Better approximation ratios?

Summary

  • For (k, t)-median/means: ˜

O(sk + t) communication, 2 rounds, O(1)-approx, (t(1 + ǫ) outliers for k-means).

  • A subquadratic time (O(1), O(1))-approx centralized

algorithms for (k, t)-median/means

  • Can handle uncertain data cases with similar comm.

and round costs.

slide-50
SLIDE 50

26-1

Thank you! Questions?