Communication-Efficient Computation on Distributed Noisy Datasets - - PowerPoint PPT Presentation

communication efficient computation on distributed noisy
SMART_READER_LITE
LIVE PREVIEW

Communication-Efficient Computation on Distributed Noisy Datasets - - PowerPoint PPT Presentation

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA15 June 15, 2015 1-1 Model of computation The coordinator model : k sites and 1 coordinator. each site has a 2-way


slide-1
SLIDE 1

1-1

Communication-Efficient Computation on Distributed Noisy Datasets

SPAA’15 June 15, 2015

Qin Zhang Indiana University Bloomington

slide-2
SLIDE 2

2-1

Model of computation

The coordinator model: k sites and 1 coordinator. – each site has a 2-way communication channel with the coordinator. – each site Si has a piece of data xi. The coordinator has ∅. – Task: compute f (x1, . . . , xk) together via communication. The coordinator reports the answer. – computation is divided into rounds. – Goal: minimize both

  • total #bits of comm. (o(Input); best polylog(Input))
  • and #rounds (O(1) or polylog(Input)).

· · ·

S1 S2 S3 Sk C

  • ne round

x1 x2 x3 xk ∅

slide-3
SLIDE 3

2-2

Model of computation

The coordinator model: k sites and 1 coordinator. – each site has a 2-way communication channel with the coordinator. – each site Si has a piece of data xi. The coordinator has ∅. – Task: compute f (x1, . . . , xk) together via communication. The coordinator reports the answer. – computation is divided into rounds. – Goal: minimize both

  • total #bits of comm. (o(Input); best polylog(Input))
  • and #rounds (O(1) or polylog(Input)).

· · ·

S1 S2 S3 Sk C

  • ne round

x1 x2 x3 xk ∅

– no constraint on #bits can be sent by each site on each round. (usually balanced) – do not count local computation (usually linear)

slide-4
SLIDE 4

3-1

The coordinator model (cont.)

Abstraction

The BSP model.

Input Map Shuffle Reduce Output

The MapReduce model.

Communication → time, energy, bandwidth, . . .

Also network monitoring, sensor networks, etc.

slide-5
SLIDE 5

3-2

The coordinator model (cont.)

Abstraction

The BSP model.

Input Map Shuffle Reduce Output

The MapReduce model.

Communication → time, energy, bandwidth, . . .

Also network monitoring, sensor networks, etc.

· · ·

S1 S2 S3 Sk C

=

slide-6
SLIDE 6

4-1

The distributed distinct elements (F0) problem

· · ·

S1 S2 S3 Sk C

· · ·

Function f can be: How many distinct elements (F0) in the union of the k bags?

slide-7
SLIDE 7

4-2

The distributed distinct elements (F0) problem

· · ·

S1 S2 S3 Sk C

· · ·

Function f can be: How many distinct elements (F0) in the union of the k bags?

Important in: traffic monitoring, query optimization, ...

slide-8
SLIDE 8

4-3

The distributed distinct elements (F0) problem

· · ·

S1 S2 S3 Sk C

· · ·

Function f can be: How many distinct elements (F0) in the union of the k bags?

Almost always allow a (1 + ǫ)-approximation Important in: traffic monitoring, query optimization, ...

slide-9
SLIDE 9

5-1

Existing solution – linear sketches

· · ·

S1 S2 S3 Sk C

· · ·

local linear sketch global sketch = local sketches

How many distinct elements (F0) in the union of the k bags?

slide-10
SLIDE 10

6-1

Linear sketches

Random linear mapping M : Rn → Rk where k ≪ n. = M x Mx (approximate) f (x)

The data. e.g., a frequency vector linear mapping sketching vector

slide-11
SLIDE 11

6-2

Linear sketches

Random linear mapping M : Rn → Rk where k ≪ n. = M x Mx (approximate) f (x)

The data. e.g., a frequency vector linear mapping sketching vector

Simple and useful: Statistical/graph/algebraic problems in data streams, compressive sensing, . . .

slide-12
SLIDE 12

6-3

Linear sketches

Random linear mapping M : Rn → Rk where k ≪ n. = M x Mx (approximate) f (x)

The data. e.g., a frequency vector linear mapping sketching vector

Perfect for distributed computation The data is distributed as x = x1 + . . . + xk; xi on site i. Merge using linearity: Mx1 + . . . + Mxk = M(x1 + . . . + xk) Simple and useful: Statistical/graph/algebraic problems in data streams, compressive sensing, . . .

slide-13
SLIDE 13

7-1

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy!

· · ·

S1 S2 S3 Sk C

John Smith, 800 Mountain Av springfield Joe Smith, 800 Mount Av Springfield Joseph Smith, 800 Mt. Road Springfield Joe Smith, 800 Mt. Road Springfield

slide-14
SLIDE 14

7-2

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy!

· · ·

S1 S2 S3 Sk C

John Smith, 800 Mountain Av springfield Joe Smith, 800 Mount Av Springfield Joseph Smith, 800 Mt. Road Springfield Joe Smith, 800 Mt. Road Springfield

We (have to) consider similar items as

  • ne element. Then how to compute F0?
slide-15
SLIDE 15

7-3

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy!

· · ·

S1 S2 S3 Sk C

John Smith, 800 Mountain Av springfield Joe Smith, 800 Mount Av Springfield Joseph Smith, 800 Mt. Road Springfield Joe Smith, 800 Mt. Road Springfield

We (have to) consider similar items as

  • ne element. Then how to compute F0?

Cannot use linear sketches

  • freq. of items rep. the same entity

may be mapped into different coordinates of the sketching vector

slide-16
SLIDE 16

8-1

Noisy data is universal

Music, Images, ... After compressions, resize, reformat, etc.

slide-17
SLIDE 17

8-2

Noisy data is universal

Music, Images, ... After compressions, resize, reformat, etc.

“SPAA 2015” “27th ACM Symposium on Parallelism in Algorithms and Architectures” “ACM FCRC SPAA’15”

Queries of the same meaning sent to Google

slide-18
SLIDE 18

9-1

Related to Entity Resolution

Related to Entity Resolution: Identify and link/group

different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT.

Centralized, detect items representing the same entity, merge/output all distinct entities.

E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial.

slide-19
SLIDE 19

9-2

Related to Entity Resolution

Related to Entity Resolution: Identify and link/group

different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT.

Centralized, detect items representing the same entity, merge/output all distinct entities. This work: distributed, statistical estimations,

E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial.

slide-20
SLIDE 20

9-3

Related to Entity Resolution

Related to Entity Resolution: Identify and link/group

different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT.

Centralized, detect items representing the same entity, merge/output all distinct entities. This work: distributed, statistical estimations, We want more communication-efficient algorithms (o(input size)), without a comprehensive de-duplication.

E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial.

slide-21
SLIDE 21

10-1

Our goal and problem

Problem: how can we perform noise-resilient statistical estimation in the coordinator model comm. efficiently?

Assume all parties are provided with a pairwise distance metric and a threshold determining whether two items u, v rep. the same entity (denoted by u ∼ v) or not.

· · ·

S1 S2 S3 Sk C

Goal: minimize communication & #rounds

slide-22
SLIDE 22

10-2

Our goal and problem

Problem: how can we perform noise-resilient statistical estimation in the coordinator model comm. efficiently?

Assume all parties are provided with a pairwise distance metric and a threshold determining whether two items u, v rep. the same entity (denoted by u ∼ v) or not.

· · ·

S1 S2 S3 Sk C

Goal: minimize communication & #rounds The distance metric design is a separate issue. We will design a framework so that users can plug-in any “distance metric” at run time.

slide-23
SLIDE 23

11-1

Remarks

Remark 1. We do not specify the distance function in our algorithms, for two reasons:

(1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance!

slide-24
SLIDE 24

11-2

Remarks

Remark 1. We do not specify the distance function in our algorithms, for two reasons:

(1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance!

Remark 2. We assume transitivity: if u ∼ v, v ∼ w then u ∼ w. In other words, the noise is “well-shaped”. One may come up with the following problematic situation: we have a ∼ b, b ∼ c, . . . , y ∼ z, however, a ∼ z. Our algorithm still work if the number of “outliers” is small.

slide-25
SLIDE 25

12-1

Remarks (cont.)

Remark 3. Do exist approaches w/o assuming transitivity. E.g., assume so-called ICAR properties [BGM+09], or use clustering based approaches [ACN08]. Unlikely to have comm.-efficient algorithms in our setting.

slide-26
SLIDE 26

12-2

Remarks (cont.)

Remark 3. Do exist approaches w/o assuming transitivity. E.g., assume so-called ICAR properties [BGM+09], or use clustering based approaches [ACN08]. Unlikely to have comm.-efficient algorithms in our setting. Remark 4. Whether there exists a magic hash function that can map (only) items in the same group into the same bucket and can be described succinctly? Answer: NO

slide-27
SLIDE 27

13-1

A few notations

  • We have k sites (machines), each holding a multiset of items Si.
  • Let multiset S =

i∈[k] Si, let m = |S|.

  • Under the transitivity assumption, S can be partitioned into a

set of groups G = {G1, . . . , Gn}. Each group Gi represents a distinct universe element.

  • ˜

O(·) hides poly log(m/ǫ) factors.

· · ·

S1 S2 S3 Sk C

slide-28
SLIDE 28

14-1

Our results

noisy data noise-free data (comm.) bits rounds bits F0 ˜ O(min{k/ǫ3, k2/ǫ2}) ˜ O(1) Ω(k/ǫ2) [WZ12,WZ14] L0-sampling ˜ O(k) ˜ O(1) Ω(k) Fp (p ≥ 1) ˜ O((kp−1 + k3)/ǫ3) O(1) Ω(kp−1/ǫ2) [WZ12] (φ, ǫ)-HH ˜ O(min{k/ǫ, 1/ǫ2}) O(1) Ω(min{

√ k ǫ , 1 ǫ2 }) [HYZ12,WZ12]

Entropy ˜ O(k/ǫ2) O(1) Ω(k/ǫ2) [WZ12]

  • 1. p-th frequency moment Fp(S) =

i∈[n] |Gi|p.

We consider F0 and Fp (p ≥ 1), and allow a (1 + ǫ)-approximation.

  • 2. L0-sampling on S: return a group Gi (or an arbitrary item in Gi)

uniformly at random from G.

  • 3. (φ, ǫ)-heavy-hitter of S (0 < ǫ ≤ φ ≤ 1) (definition omitted)
  • 4. Empirical entropy: Entropy(S) =

i∈[n] |Gi | m log m |Gi |.

We allow a (1 + ǫ)-approximation.

slide-29
SLIDE 29

15-1

Take-home message:

In the distributed setting, we can handle well-shaped noise in statistical estimations almost for free!

slide-30
SLIDE 30

16-1

Rest of the talk: Algorithms for F0

  • 1. Simple Sampling
  • 2. Local Hierarchical Partition +

Distributed Rejection Sampling

Simple. ˜ O(k2/ǫ2) comm. 2 rounds. Complicated. ˜ O(k/ǫ3) comm. ˜ O(1) rounds

Better than ˜ O(k2/ǫ2) bits because (1) we want to scale on k (2) later used in ℓ0-sampling with ǫ = Θ(1)

slide-31
SLIDE 31

17-1

Simple-Sampling

  • 1. Let m = |S| =

i∈[k] |Si|.

  • 2. For j = 1, . . . , η = Θ(k/ǫ2)

(a) jointly sample a random item uj ∈ S; Let Guj be the group containing uj. (b) jointly compute

  • Guj
  • , and set Xj = 1/
  • Guj
  • .
  • 3. Output m

η

  • j∈[k] Xj.

Theorem

Simple-Sampling gives a (1 + ǫ) approximation of F0 with probability 2/3 using ˜ O(k2/ǫ2) bits and 2 rounds.

Algorithm Simple-Sampling

· · ·

S1 S2 S3 Sk C

slide-32
SLIDE 32

18-1

Hierarchical partition + distributed rejection sampling

˜ O(k/ǫ3) bits ˜ O(1) rounds

slide-33
SLIDE 33

19-1

Main idea: reduce the variance of Xj in Simple-Sampling – If we can partition all groups in G into classes G0, . . . , Glog k such that Gℓ = {G ∈ G | |G| ∈ (2ℓ−1, 2ℓ]}, and apply Algo Simple-Sampling on each class individually. By doing this we can shave a factor of k in the number of samples Xj needed ( η : k/ǫ2 → 1/ǫ2).

Hierarchical partition + distributed rejection sampling

slide-34
SLIDE 34

19-2

Main idea: reduce the variance of Xj in Simple-Sampling – If we can partition all groups in G into classes G0, . . . , Glog k such that Gℓ = {G ∈ G | |G| ∈ (2ℓ−1, 2ℓ]}, and apply Algo Simple-Sampling on each class individually. By doing this we can shave a factor of k in the number of samples Xj needed ( η : k/ǫ2 → 1/ǫ2). – However, we cannot afford to partition the groups into classes in the distributed setting.

Hierarchical partition + distributed rejection sampling

slide-35
SLIDE 35

19-3

Main idea: reduce the variance of Xj in Simple-Sampling – If we can partition all groups in G into classes G0, . . . , Glog k such that Gℓ = {G ∈ G | |G| ∈ (2ℓ−1, 2ℓ]}, and apply Algo Simple-Sampling on each class individually. By doing this we can shave a factor of k in the number of samples Xj needed ( η : k/ǫ2 → 1/ǫ2). – However, we cannot afford to partition the groups into classes in the distributed setting. Our techniques: local hierarchical partition (have inconsistency) + distributed rejection sampling (resolve the inconsistency)

Fairly complicated (use Algo Simple-Sampling as a subroutine). See the paper for details.

Hierarchical partition + distributed rejection sampling

slide-36
SLIDE 36

20-1

Other problems

  • 1. L0-sampling: ˜

O(k) communication and ˜ O(1) rounds. – Use Algorithm for F0 as a subroutine

  • 2. p-th frequency moment: ˜

O((kp−1 + k3)/ǫ3) comm. and ˜ O(1) rounds. – Adapt a very algo by Kannan, Vempala and Woodruff. (COLT 2014)

  • 3. (φ, ǫ)-heavy-hitter: ˜

O(min{k/ǫ, 1/ǫ2}) comm. and O(1) rounds. – Easy

  • 4. Empirical entropy: ˜

O(k/ǫ2) comm. and O(1) rounds. – Adapt an algo by Chakrabarti, Cormode and McGregor (SODA 2007) in data stream

slide-37
SLIDE 37

21-1

Open problems

  • A number of bounds can possibly be improved. For example:

– Can we get a (better) upper bound ˜ O(k/ǫ2) for F0? – Can we improve the round complexities of F0 and L0-sampling from ˜ O(1) to O(1)? – Can we remove the k3 factor in the communication cost for Fp?

slide-38
SLIDE 38

21-2

Open problems

  • A number of bounds can possibly be improved. For example:

– Can we get a (better) upper bound ˜ O(k/ǫ2) for F0? – Can we improve the round complexities of F0 and L0-sampling from ˜ O(1) to O(1)? – Can we remove the k3 factor in the communication cost for Fp?

  • Can we obtain efficient algorithms for L2-heavy-hitter and

Lp-sampling?

slide-39
SLIDE 39

21-3

Open problems

  • A number of bounds can possibly be improved. For example:

– Can we get a (better) upper bound ˜ O(k/ǫ2) for F0? – Can we improve the round complexities of F0 and L0-sampling from ˜ O(1) to O(1)? – Can we remove the k3 factor in the communication cost for Fp?

  • Can we obtain efficient algorithms for L2-heavy-hitter and

Lp-sampling?

  • Lower bounds?
slide-40
SLIDE 40

21-4

Open problems

  • A number of bounds can possibly be improved. For example:

– Can we get a (better) upper bound ˜ O(k/ǫ2) for F0? – Can we improve the round complexities of F0 and L0-sampling from ˜ O(1) to O(1)? – Can we remove the k3 factor in the communication cost for Fp?

  • Can we obtain efficient algorithms for L2-heavy-hitter and

Lp-sampling?

  • Lower bounds?
  • Relax/replace the transitivity assumption
slide-41
SLIDE 41

22-1

Thank you! Questions?