1-1
Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - - PowerPoint PPT Presentation
Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn - - PowerPoint PPT Presentation
Distributed Partial Clustering Sudipto Guha Qin Zhang Yi Li Upenn IUB NTU SPAA 2017 July 25, 2017 1-1 Clustering Metric space ( X , d ) n input points A ; want to find k centers Objective function ( k -median): min d ( p
2-1
Clustering
- Metric space (X, d)
- n input points A; want to find k centers
- Objective function (k-median):
min
K⊆A:|K|=k
- p∈A
d(p, K)
- k-means:
p∈A d2(p, K)
k-center: maxp∈A d(p, K)
3-1
Clustering with outliers
- Metric space (X, d)
- n input points A; want to find k centers, t outliers
- Objective function ((k, t)-median):
min
K,O⊆A:|K|=k,|O|≤t
- p∈A\O
d(p, K)
- (k, t)-means:
p∈A\O d2(p, K)
(k, t)-center: maxp∈A\O d(p, K) Motivation: partial optimization gives much better results
4-1
Distributed clustering
- s sites, coordinator model
- Site i gets Ai, parties want to cluster A = A1 ∪ . . . ∪ As
- Want to minimize comm. cost and #comm. rounds
- For simiplicity, assume each point takes ˜
O(1) bits Motivation: data is inherently distributed / data is big and does not fit one machine
· · ·
S1 S2 S3 Ss C
- ne round
A1 A2 A3 As ∅ Coordinator model
5-1
Clustering on uncertain data
- Each data item j is a distribution; call it a node.
Motivation: data is noisy; a subfield in databases Let σ(j) denote a realization, π(j) the center point to which j is attached.
5-2
Clustering on uncertain data
- Each data item j is a distribution; call it a node.
Motivation: data is noisy; a subfield in databases Let σ(j) denote a realization, π(j) the center point to which j is attached.
- Objective function (k-median):
min
K⊆A:|K|=k
- j∈A
Eσ[d(σ(j), π(j))]
- k-means: replace d(p, K) with d2(p, K).
- k-center has two versions: E and max do not commute.
– maxj∈A E[d(σ(j), π(j))] – E [maxj∈A d(σ(j), π(j))]
6-1
Clustering with outlier on uncertain data
- Each data item j is a distribution; call it a node.
Let σ(j) denote a realization, π(j) the center point to which j is attached.
- Objective function ((k, t)-median):
min
K,O⊆A:|K|=k,|O|≤t
- p∈A\O
Eσ[d(σ(j), π(j))]
- (k, t)-means: replace d(p, K) with d2(p, K).
- (k, t)-center has two versions: E and max do not commute.
– maxj∈A\O E[d(σ(j), π(j))] – E
- maxj∈A\O d(σ(j), π(j))
- (k, t)-center-pp
(k, t)-center-global
7-1
Problems studied before
- Clustering
- Clustering with outliers
- Clustering on uncertain data
- Distributed clustering
- Distributed clustering with outliers for k-center
Old and New Problems
New problems
- Distributed clustering with outliers for k-median/means
- Distributed clustering (with outliers) for uncertain data
[CKMN, 2001] [CM, 2008] [??, XXXX] [??, XXXX] [MKCWM, 2015] Implicitly also in [GMMMO, 2003]
7-2
Problems studied before
- Clustering
- Clustering with outliers
- Clustering on uncertain data
- Distributed clustering
- Distributed clustering with outliers for k-center
Old and New Problems
New problems
- Distributed clustering with outliers for k-median/means
- Distributed clustering (with outliers) for uncertain data
This paper This paper This paper [CKMN, 2001] [CM, 2008] [??, XXXX] [??, XXXX] [MKCWM, 2015] Implicitly also in [GMMMO, 2003]
8-1
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
8-2
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
Main results (all in 2 rounds; under the same framework):
- (O(1), 1)-approx with ˜
O(sk + t) comm. for (k, t)-median/center
8-3
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
Main results (all in 2 rounds; under the same framework):
- (O(1), 1)-approx with ˜
O(sk + t) comm. for (k, t)-median/center
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time
8-4
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
Main results (all in 2 rounds; under the same framework):
- (O(1), 1)-approx with ˜
O(sk + t) comm. for (k, t)-median/center
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)
8-5
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
Main results (all in 2 rounds; under the same framework):
- (O(1), 1)-approx with ˜
O(sk + t) comm. for (k, t)-median/center
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for uncertain (k, t)-median/means and (k, t)-center-pp Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)
8-6
Main results
Bicriteria: (α, β)-approx if the cost of SOL is at most αC while excluding βt points, where C is OPT for excluding t points
Main results (all in 2 rounds; under the same framework):
- (O(1), 1)-approx with ˜
O(sk + t) comm. for (k, t)-median/center
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for (k, t)-median and (k, t)-means, with quadratic local time
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + t) comm. for uncertain (k, t)-median/means and (k, t)-center-pp
- ((1 + 1/ǫ), 1 + ǫ)-approx with ˜
O(sk + tI + s log ∆) comm. for uncertain (k, t)-center-global, where I is the info to encode the distribution, and ∆ is the max-distance/min-distance Also leads to subquadratic time (O(1), O(1))-approx centralized algorithms (open for many years)
9-1
Previous results on distributed clustering with outliers
- ˜
O(sk + st) bits in 2 rounds for k-center
(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)
- ˜
O(sk + st) bits in 1 round for k-median/means/center
Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)
9-2
Previous results on distributed clustering with outliers
- ˜
O(sk + st) bits in 2 rounds for k-center
(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)
- ˜
O(sk + st) bits in 1 round for k-median/means/center
Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)
- Interesting range of parameters: n ≫ t ≫ k, s.
Consider a modest data set, say n = 108. Suppose that 0.1% is noise, thus t = 0.001 × n = 105. Say s = 1000, and k = 100. Then sk = 105, and st = 108 Consequently sk + st = 108 while sk + t = 105
9-3
Previous results on distributed clustering with outliers
- ˜
O(sk + st) bits in 2 rounds for k-center
(Malkomes, Kusner, Chen, Weinberger, Moseley. 2015)
- ˜
O(sk + st) bits in 1 round for k-median/means/center
Can be derived from (Guha, Meyerson, Mishra, Motwani, O’Callaghan. 2003)
- Interesting range of parameters: n ≫ t ≫ k, s.
Consider a modest data set, say n = 108. Suppose that 0.1% is noise, thus t = 0.001 × n = 105. Say s = 1000, and k = 100. Then sk = 105, and st = 108 Consequently sk + st = 108 while sk + t = 105
- Goal: reduce the st term to t, since
the difference ⇒ your data/energy/time bill.
10-1
More related work
(Centralized) clustering with outliers
- 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median
(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)
- (k, t)-median with different loss functions. (Feldman, Schulman, 2012)
10-2
More related work
(Centralized) clustering with outliers
- 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median
(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)
- (k, t)-median with different loss functions. (Feldman, Schulman, 2012)
Uncertain data
- Uncertain k-center/median/means (Cormode, McGregor, 2008)
- Better results for k-center (Guha, Munagala, 2009)
10-3
More related work
(Centralized) clustering with outliers
- 3-approx for (k, t)-center, (O(1), O(1))-approx for (k, t)-median
(Charikar, Khuller, Mount, Narasimhan, 2001) O(1)-approx for (k, t)-median by Ke Chen (2008)
- (k, t)-median with different loss functions. (Feldman, Schulman, 2012)
Uncertain data
- Uncertain k-center/median/means (Cormode, McGregor, 2008)
- Better results for k-center (Guha, Munagala, 2009)
Distributed clustering (coordinator model)
- O(1)-approx with ˜
O(kd + sk) for k-median/means in d-dim Euclidean space (Balcan, Ehrlich, Liang, 2013)
- Better results for k-means by (Liang, Balcan, Kanchanapally,
Woodruff, 2014), and (Cohen, Elder, Musco, Musco, Persu, 2015).
11-1
Distributed (k, t)-median and the Algorithm Framework
12-1
Two-level distributed clustering (GMMMO, 2003)
- Consider k-median. Let A = A1 ∪ . . . ∪ As.
- Site i computes Mi which is the set of centers of the local
k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions
- Coordinator computes weighted clustering on M
12-2
Two-level distributed clustering (GMMMO, 2003)
- Consider k-median. Let A = A1 ∪ . . . ∪ As.
- Site i computes Mi which is the set of centers of the local
k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions
- Coordinator computes weighted clustering on M
Any O(1)-approx, but can assume
- ptimal for now
12-3
Two-level distributed clustering (GMMMO, 2003)
- It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,
and L = O(Copt(A, k)); we thus get O(1)-approx
- Consider k-median. Let A = A1 ∪ . . . ∪ As.
- Site i computes Mi which is the set of centers of the local
k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions
- Coordinator computes weighted clustering on M
12-4
Two-level distributed clustering (GMMMO, 2003)
- It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,
and L = O(Copt(A, k)); we thus get O(1)-approx
- Similar result holds for (k, t)-median
- Consider k-median. Let A = A1 ∪ . . . ∪ As.
- Site i computes Mi which is the set of centers of the local
k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions
- Coordinator computes weighted clustering on M
12-5
Two-level distributed clustering (GMMMO, 2003)
- It holds that Csol(M, k) ≤ O(1) · (L + Copt(A, k)) + L,
and L = O(Copt(A, k)); we thus get O(1)-approx
- Similar result holds for (k, t)-median
– Each site computes a local (k, t)-median and then sends both k centers (and their weights) and t outliers to the coordinator – Coordinator performs a second-level clustering
- Comm. cost: ˜
O(sk + st)
- Consider k-median. Let A = A1 ∪ . . . ∪ As.
- Site i computes Mi which is the set of centers of the local
k-median solution on Ai. Weight of each p ∈ Mi is the number of points assigned to p Let M = M1 ∪ . . . ∪ Ms Let L : sum of costs of local solutions
- Coordinator computes weighted clustering on M
13-1
Local solutions
- Let t∗
i be # excluded points in Ai in OPT(A, k, t)
- One can show that
- i∈[s]
Copt(Ai, k, t∗
i ) ≤ O(1) · Copt(A, k, t)
- Thus if we knows t∗
i , then we can apply the result on
the previous slide. (but of course we do not know t∗
i )
13-2
Local solutions
- Let t∗
i be # excluded points in Ai in OPT(A, k, t)
- One can show that
- i∈[s]
Copt(Ai, k, t∗
i ) ≤ O(1) · Copt(A, k, t)
- Thus if we knows t∗
i , then we can apply the result on
the previous slide. (but of course we do not know t∗
i )
- We want to miminize
- i∈[s] Csol(Ai, k, ti) s.t.
i∈[s] ti = t.
14-1
Waterfilling
- Miminize
i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.
- Convexify the functions
t t t Csol(A1, k, ·)
S1 S2 S3
14-2
Waterfilling
- Miminize
i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.
- Convexify the functions
Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)
t t t Csol(A1, k, ·) f1(·)
S1 S2 S3
14-3
Waterfilling
- Miminize
i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.
- Convexify the functions
Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)
t t t Csol(A1, k, ·) f1(·)
Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) S1 S2 S3
14-4
Waterfilling
- Miminize
i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.
- Convexify the functions
- Sort all “slopes” ℓ(i, q) = fi(q − 1) − fi(q) (i ∈ [s], q ∈ [t])
and choose the value η of rank t as threshold. Note: for a fixed i, ℓ(i, q) (q ∈ [t]) are non-increasing
Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)
t t t Csol(A1, k, ·) f1(·)
Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) S1 S2 S3
14-5
Waterfilling
- Miminize
i∈[s] Csol(Ai, k, ti) s.t. i∈[s] ti = t.
- Convexify the functions
- Sort all “slopes” ℓ(i, q) = fi(q − 1) − fi(q) (i ∈ [s], q ∈ [t])
and choose the value η of rank t as threshold. Note: for a fixed i, ℓ(i, q) (q ∈ [t]) are non-increasing
Blue curve fi(·) is the lower convex hull of red curve Csol(Ai, k, ·)
t t t Csol(A1, k, ·) f1(·)
Need some work to show that fi(·) is a good approx of Csol(Ai, k, ·) t1 t2 t3 S1 S2 S3
ti is the number of slopes ℓ(i, ·) at Site i that are at least η.
15-1
Two-round algorithm
- Each site i sends ℓ(i, q) to coordinator for
q = 1, 2, 3, . . . , t
- Coordinator determines the threshold (rank t element)
and sends it to sites
- Each site i determines ti and sends local centers (and
their weights) and the ti outliers. Recall that
i∈[s] ti = t
- Coordinator solves the (k, t)-median problem on the
(weighted) centers and outliers
15-2
Two-round algorithm
- Each site i sends ℓ(i, q) to coordinator for
q = 1, 2, 3, . . . , t
- Coordinator determines the threshold (rank t element)
and sends it to sites
- Each site i determines ti and sends local centers (and
their weights) and the ti outliers. Recall that
i∈[s] ti = t
- Coordinator solves the (k, t)-median problem on the
(weighted) centers and outliers
- Comm. cost ˜
O(sk + st). No improvement, hmm?
16-1
Two-round algorithm (cont.)
- Each site i sends ℓ(i, q) to coordinator for
q = 1, 2, 4, 8, . . . , t
- Coordinator determines the threshold (rank 2t element)
- Each site i determines ti and sends local centers (and
their weights) and the ti outliers Can show that
i∈[s] ti ≤ 3t
- Coordinator solves the (k, t)-median problem on the
(weighted) centers and outliers
- Comm. cost ˜
O(sk+t).
16-2
Two-round algorithm (cont.)
- Each site i sends ℓ(i, q) to coordinator for
q = 1, 2, 4, 8, . . . , t
- Coordinator determines the threshold (rank 2t element)
- Each site i determines ti and sends local centers (and
their weights) and the ti outliers Can show that
i∈[s] ti ≤ 3t
- Coordinator solves the (k, t)-median problem on the
(weighted) centers and outliers
- Comm. cost ˜
O(sk+t).
we can do this because ℓ(i, q) are non-increasing for a fixed i
17-1
Two-round algorithm, bicriteria
- Each site i sends ℓ(i, q) to coordinator for
q = 1, 2, 4, 8, . . . , t
- Coordinator determines the threshold (rank 2t element)
- Each site i determines ti and sends local centers (with
the number of associated points) and the ti outliers. Can show that
i∈[s] ti ≤ 3t
- Coordinator solves the (k, (1 + ǫ)t)-median problem on
the (weighted) centers and ignored points
- Comm. cost ˜
O(sk + t). Quadratic time locally
18-1
Subquadratic-time centralized algorithm
The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n
2+2α 2+α k2)
18-2
Subquadratic-time centralized algorithm
The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n
2+2α 2+α k2)
– Apply the distributed (k, t)-median/means algo after dividing the set of points arbitrarily into s pieces of size n/s. – The sequential simulation of the s sites takes time ˜ O(s (n/s)1+α k2). – The coordinator requires time ˜ O((sk + t)2) = ˜ O(s2k2) + ˜ O(t2). – Finally balance n1+α = s2+α
18-3
Subquadratic-time centralized algorithm
The reduction: (for (k, t)-median/means) A (γ, O(1))-approx centralized algo with time ˜ O(n1+αk2) ⇒ A (O(γ), 2)-approx centralized algo with time ˜ O(t2 + n
2+2α 2+α k2)
– Apply the distributed (k, t)-median/means algo after dividing the set of points arbitrarily into s pieces of size n/s. – The sequential simulation of the s sites takes time ˜ O(s (n/s)1+α k2). – The coordinator requires time ˜ O((sk + t)2) = ˜ O(s2k2) + ˜ O(t2). – Finally balance n1+α = s2+α
Apply the reduction O(1) times to further reduce the running time to ˜ O(n1.01k2) (assuming t ≤ √n) at the cost of larger (but still O(1)) approx.
19-1
Distributed (k, t)-Center
20-1
Gonzalez’s algorithm
Gonzalez’s algorithm for k-center
- Let S = {z1, . . . , zn} be a data set
- Choose z1 ∈ S arbitrary as the first center. Let
Zi = {z1, . . . , zi}
- For i = 2 to n, set zi = arg maxx∈S d(x, Zi−1)
Get an ordering z1, . . . , zn of S
21-1
Two-round algorithm for distributed (k, t)-center
- Site i runs Gonzales’s algorithm and obtain a
re-ordering {a1, . . . , ani} of the points in Ai
- Site i, for each 1 ≤ q ≤ t, computes
ℓ(i, q) ← min
j<k+q d(aj, ak+q)
- Sites and coordinator sort {ℓ(i, q)}, and then
follow the subsequent steps in the previous
- framework. In the second level clustering we use
an algo for k-center with exactly t outliers.
22-1
Uncertain Data
23-1
Uncertain data – (k, t)-median/means/center-pp
Eσ[d(σ(j), yj)] Usual distance d(yi, yj)
- Reduce the clustering problems to the deterministic
case; collapse each node/cloud j to its optimal center. yj = arg min
y
Eσ[d(σ(j), y)]
- Fully connect yj’s using the metric distance
- Attach a vertex pj to yj, with edge cost Eσ[d(σ(j), yj)]
- Apply the previous framework on the compressed graph;
dG(u, v) between u, v ∈ G is the shortest path distance
24-1
Uncertain data – (k, t)-center-global
- Use the idea from [Guha, Munagala, 2009], reduce
center to median.
- Use a truncated distance function
Lτ(u, v) = max{d(u, v) − τ, 0} ρτ(j, u) = Eσ[Lτ(σ(j), u)]
- Perform a parametric search on τ, and then apply
- ur previous framework
- Find a τ s.t.
i Csol(Ai, 2k, ti(τ), ρτ) ≈ τ, where
ti(τ) is #local outliers at Site i (after applying the previous framework)
25-1
Concluding remarks
Open problems
- Lower bounds. Ω(sk + t) for (k, t)-median/means/center
if the algo needs to output all the outliers. What if not?
- Better approximation ratios?
Summary
- For (k, t)-median/means: ˜
O(sk + t) communication, 2 rounds, O(1)-approx, (t(1 + ǫ) outliers for k-means).
- A subquadratic time (O(1), O(1))-approx centralized
algorithms for (k, t)-median/means
- Can handle uncertain data cases with similar comm.
and round costs.
26-1