Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - - PowerPoint PPT Presentation

sublinear algorithms for big data part 4 random topics
SMART_READER_LITE
LIVE PREVIEW

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS10, JACM12) 2-1 Distributed streaming Motivated by


slide-1
SLIDE 1

1-1

Qin Zhang

Sublinear Algorithms for Big Data Part 4: Random Topics

slide-2
SLIDE 2

2-1

Topic 3: Random sampling in distributed data streams

(based on a paper with Cormode, Muthukrishnan and Yi, PODS’10, JACM’12)

slide-3
SLIDE 3

3-1

Distributed streaming

Adaptive filters [Olston, Jiang, Widom, SIGMOD’03] A generic geometric approach [Scharfman et al. SIGMOD’06] Prediction models [Cormode, Garofalakis, Muthukrishnan,

Rastogi, SIGMOD’05]

Motivated by database/networking applications

environment monitoring network monitoring sensor networks cloud computing

slide-4
SLIDE 4

4-1

Reservoir sampling [Waterman ’??; Vitter ’85]

Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample

slide-5
SLIDE 5

4-2

Reservoir sampling [Waterman ’??; Vitter ’85]

Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom

slide-6
SLIDE 6

5-1

Reservoir sampling from distributed streams

· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i

slide-7
SLIDE 7

5-2

Reservoir sampling from distributed streams

· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform

slide-8
SLIDE 8

5-3

Reservoir sampling from distributed streams

· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform Key observation: We don’t have to know the size of the population in order to sample!

slide-9
SLIDE 9

6-1

Basic idea: binary Bernoulli sampling

slide-10
SLIDE 10

6-2

1 1 1 1 1 1 1 1 1 1 1 1

Basic idea: binary Bernoulli sampling

slide-11
SLIDE 11

6-3

1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items

Basic idea: binary Bernoulli sampling

slide-12
SLIDE 12

6-4

1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size between s and O(s)

Basic idea: binary Bernoulli sampling

slide-13
SLIDE 13

17-1

Random sampling – Algorithm

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i

· · ·

S1 S2 S3 Sk

C

coordinator sites [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]

slide-14
SLIDE 14

17-2

Random sampling – Algorithm

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.

· · ·

S1 S2 S3 Sk

C

coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]

slide-15
SLIDE 15

17-3

Random sampling – Algorithm

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.

· · ·

S1 S2 S3 Sk

C

coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]

slide-16
SLIDE 16

17-4

Random sampling – Algorithm

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.

· · ·

S1 S2 S3 Sk

C

coordinator sites upper lower

Correctness:

(1): In epoch i, each item is maintained in C w. pr. 2−i (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]

slide-17
SLIDE 17

17-5

Random sampling – Algorithm

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.

· · ·

S1 S2 S3 Sk

C

coordinator sites upper lower

Correctness:

(2): Always ≥ s items are maintained in C (1): In epoch i, each item is maintained in C w. pr. 2−i (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]

slide-18
SLIDE 18

18-1

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

Maintain s = 3 samples

slide-19
SLIDE 19

18-2

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples

slide-20
SLIDE 20

18-3

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1

slide-21
SLIDE 21

18-4

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2

slide-22
SLIDE 22

18-5

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 2

slide-23
SLIDE 23

18-6

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 2 3 3 4 4

slide-24
SLIDE 24

18-7

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 2 3 3 4 4 5

slide-25
SLIDE 25

18-8

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 2 3 3 4 4 5 5

slide-26
SLIDE 26

18-9

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 2 3 3 4 4 5 5

Now |lower sample| = 3

  • discard upper sample
  • split lower sample
  • advance to Epoch 1
slide-27
SLIDE 27

18-10

A running example

S1 S2 S3 S4 C

coordinator sites Epoch 0 (p = 1)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5

Now |lower sample| = 3

  • discard upper sample
  • split lower sample
  • advance to Epoch 1

4

slide-28
SLIDE 28

19-1

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4

slide-29
SLIDE 29

19-2

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard)

slide-30
SLIDE 30

19-3

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7

slide-31
SLIDE 31

19-4

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8

slide-32
SLIDE 32

19-5

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard)

slide-33
SLIDE 33

19-6

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard) 10

slide-34
SLIDE 34

19-7

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard) 10 10

slide-35
SLIDE 35

19-8

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8

Again |lower sample| = 3

  • discard upper sample
  • split lower sample
  • advance to Epoch 2

9 (discard) 10 10

slide-36
SLIDE 36

20-1

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites Epoch 1 (p = 1/2)

upper lower

Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10

Again |lower sample| = 3

  • discard upper sample
  • split lower sample
  • advance to Epoch 2
slide-37
SLIDE 37

20-2

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites

upper lower

Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10 Epoch 2 (p = 1/4)

More items will be discarded locally

slide-38
SLIDE 38

20-3

A running example (cont.)

S1 S2 S3 S4 C

coordinator sites

upper lower

Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10 Epoch 2 (p = 1/4)

More items will be discarded locally

Intuition: maintain a sample prob. at each site p ≈ s/n (n: total # items) without knowing n.

slide-39
SLIDE 39

21-1

Random sampling – Analysis

Analysis: (Msgs sent)

Messages sent per epoch O(k + s) × # epochs O(log n) = O((k + s) log n)

Initialize i = 0 In epoch i:

Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Splits the lower sample into a new lower sample and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.

· · ·

S1 S2 S3 Sk

C

coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1))

slide-40
SLIDE 40

22-1

Random sampling – Analysis and experiments

  • 1. improved to Θ(k logk/s n + s log n) and
  • 2. extended to sliding window cases.

Can be

slide-41
SLIDE 41

22-2

Random sampling – Analysis and experiments

0 1 2 3 4 5 6 7 8 9 1011 9 10 11 12 13 14 log2(cost) log2 s 15

practice theory Experiments on the real data set from 1998 world cup logs.

Basic case cost VS sample size

n = 7000000, k = 128 0 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 log2(cost) log2 s

Time-based sliding cost VS sample size

n = 320000, k = 128 cost ×103 1 2 3 4 5 6 log2

  • w

104

  • 7

8 9 10 6 11

Time-based sliding cost VS window size

s = 128, k = 128

  • 1. improved to Θ(k logk/s n + s log n) and
  • 2. extended to sliding window cases.

Can be

slide-42
SLIDE 42

22-3

Random sampling – Analysis and experiments

0 1 2 3 4 5 6 7 8 9 1011 9 10 11 12 13 14 log2(cost) log2 s 15

practice theory Experiments on the real data set from 1998 world cup logs.

Basic case cost VS sample size

n = 7000000, k = 128

  • total # items n = 7, 000, 000
  • # items sent ≈ 4, 000
  • size of sample s = 128
  • # sites k = 128
  • 1. improved to Θ(k logk/s n + s log n) and
  • 2. extended to sliding window cases.

Can be

slide-43
SLIDE 43

10-1

Sampling from a (time-based) sliding window

sliding window expired windows frozen window current window t

slide-44
SLIDE 44

10-2

Sampling from a (time-based) sliding window

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR t

slide-45
SLIDE 45

10-3

Sampling from a (time-based) sliding window

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine. t

slide-46
SLIDE 46

10-4

Sampling from a (time-based) sliding window

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.

The key issue: how to guarantee “both have sizes ≥ s”?

as items in the frozen window are expiring ... t

slide-47
SLIDE 47

10-5

Sampling from a (time-based) sliding window

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.

The key issue: how to guarantee “both have sizes ≥ s”?

as items in the frozen window are expiring ... Solution: In the frozen window, find a good sample rate such that the sample size ≥ s. t

slide-48
SLIDE 48

11-1

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication t

slide-49
SLIDE 49

11-2

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication

Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)

t (s = 2)

slide-50
SLIDE 50

11-3

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication

Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)

Guaranteed: There is a blue window with ≥ s sampled items that covers the unexpired portion of the frozen window t (s = 2)

slide-51
SLIDE 51

12-1

Dealing with the frozen window: The algorithm

Each site builds its own level-sampling structure for the current window until it freezes

(s = 2) Needs O(s log w) space and O(1) time per item

slide-52
SLIDE 52

12-2

Dealing with the frozen window: The algorithm

Each site builds its own level-sampling structure for the current window until it freezes

(s = 2) Needs O(s log w) space and O(1) time per item

When the current window freezes

For each level, do a k-way merge to build the level of the global structure at the coordinator. Total communication O((k + s) log w)