sublinear algorithms for big data part 4 random topics
play

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS10, JACM12) 2-1 Distributed streaming Motivated by


  1. Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1

  2. Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS’10, JACM’12) 2-1

  3. Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD’03] A generic geometric approach [Scharfman et al. SIGMOD’06] Prediction models [Cormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD’05] sensor networks network monitoring environment monitoring cloud computing 3-1

  4. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 4-1

  5. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s / i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s / i , throw it away 4-2

  6. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 S 1 time 5-1

  7. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · C S 3 S 2 S 1 time 5-2

  8. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · Key observation: C S 3 We don’t have to know the size of the population in order S 2 to sample! S 1 time 5-3

  9. Basic idea: binary Bernoulli sampling 6-1

  10. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 6-2

  11. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items 6-3

  12. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size between s and O ( s ) 6-4

  13. Random sampling – Algorithm [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] C coordinator Initialize i = 0 sites S 2 S 1 S 3 S k · · · In epoch i : Sites send in every item w.pr. 2 − i 17-1

  14. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) 17-2

  15. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample 17-3

  16. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: 17-4

  17. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: (2): Always ≥ s items are maintained in C 17-5

  18. upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 18-1

  19. upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-2

  20. upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-3

  21. upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-4

  22. upper A running example 2 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-5

  23. upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 4 18-6

  24. upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-7

  25. upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-8

  26. upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-9

  27. upper A running example 4 1 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-10

  28. upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 19-1

  29. upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 19-2

  30. upper 4 7 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 19-3

  31. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 19-4

  32. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 19-5

  33. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-6

  34. upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-7

  35. upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-8

  36. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-1

  37. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Epoch 2 ( p = 1 / 4) coordinator C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-2

  38. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Intuition: maintain a sample prob. Epoch 2 ( p = 1 / 4) coordinator at each site p ≈ s/n ( n : total # items) without knowing n . C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend