random sampling on big data techniques and applications
play

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a


  1. Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk

  2. “Big Data” in one slide The 3 V’s :  Volume  Velocity  Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data 2 Random Sampling on Big Data

  3. Dealing with Big Data  The first approach: scale up / out the computation  Many great technical innovations: – Distributed/parallel systems – Simpler programming models • MapReduce, Pregel, Dremel, Spark… • BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL  This talk is not about this approach! 3 Random Sampling on Big Data

  4. Downsizing data  A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms • See tutorial by Graham Cormode for other data summaries  Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures • Good old RAM model no longer applies 4 Random Sampling on Big Data

  5. Outline for the talk  Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries  Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins)  Will jump back and forth between theory and practice 5 Random Sampling on Big Data

  6. Simple Random Sampling  Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times The statistical difference is  Sampling with replacement very small, for 𝑜 ≫ 𝑡 – Randomly draw an element – Put it back – Repeat s times  Trivial in the RAM model 6 Random Sampling on Big Data

  7. Random Sampling from a Data Stream  A stream of elements coming in at high speed  Limited memory  Need to maintain the sample continuously  Applications – Data stored on disk – Network traffic 7 Random Sampling on Big Data

  8. 8 Random Sampling on Big Data

  9. Reservoir Sampling  Maintain a sample of size 𝑡 drawn (without replacement) from all elements in the stream so far  Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡  Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜 , use it to replace an item in the current sample chosen uniformly at random – With probability 1 − 𝑡/𝑜 , throw it away  Perhaps the first “streaming” algorithm [Waterman ??; Knuth’s book] 9 Random Sampling on Big Data

  10. Correctness Proof  By induction on 𝑜 – 𝑜 = 𝑡 : trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1 : • The new element is sampled with probability 𝑡 𝑜+1 • Any element in the current sample is sampled with probability 𝑡 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡 𝑡 𝑜 ⋅ 1 − 𝑜+1 + = 𝑜+1 . Yeah! 𝑡  This is a wrong (incomplete) proof 𝑡  Each element being sampled with probability 𝑜 is not a sufficient condition of random sampling – Counter example: Divide elements into groups of 𝑡 and pick one group randomly 10 Random Sampling on Big Data

  11. 11 Random Sampling on Big Data

  12. Reservoir Sampling Correctness Proof  Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same probability to be the sample  Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12 Random Sampling on Big Data

  13. Sampling from Distributed Streams  One coordinator and 𝑙 sites  Each site can communicate with the coordinator  Goal: Maintain a random sample of size 𝑡 over the union of all streams with minimum communication  Difficulty: Don’t know 𝑜 , so can’t run reservoir sampling algorithm  Key observation: Don’t have to know 𝑜 in order to sample! [Cormode, Muthukrishnan , Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura , DISC’11] 13 Random Sampling on Big Data

  14. Reduction from Coin Flip Sampling  Flip a fair coin for each element until we get “1”  An element is active on a level if it is “0”  If a level has ≥ 𝑡 active elements, we can draw a sample from those active elements  Key: The coordinator does not want all the active elements, which are too many! – Choose a level appropriately 14 Random Sampling on Big Data

  15. The Algorithm  Initialize 𝑗 ← 0  In round 𝑗 : – Sites send in every item w.p. 2 −𝑗 (This is a coin-flip sample with prob. 2 −𝑗 ) – Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 −(𝑗+1) ) – When the lower sample reaches size 𝑡 , the coordinator broadcasts to advance to round 𝑗 ← 𝑗 + 1 – Discard the upper sample – Split the lower sample into a new lower sample and a higher sample 15 Random Sampling on Big Data

  16. Communication Cost of Algorithm  Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙)  Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2 −𝑗 to contribute: need Θ(2 𝑗 𝑡) items  Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log 𝑙/𝑡 𝑜 + 𝑡 log 𝑜) – A matching lower bound  Sliding windows 16 Random Sampling on Big Data

  17. Random Sampling for Range Queries [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award] 17 Random Sampling on Big Data

  18. Online Range Sampling  Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination.  Naïve solutions:  Parameters: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – 𝑜: data size 𝑡𝑜 – Sample then query: 𝑃 𝑟 – 𝑟: query size (store data in random order) – 𝑡: sample size 𝑡𝑜 (not known beforehand)  New solution: 𝑃 𝑔 + 𝑡 𝑟 – 𝑜 ≫ 𝑟 ≫ 𝑡 𝑔(𝑦) : # canonical nodes in tree of size 𝑦 , between log 𝑦 and 𝑦 [Wang, Christensen, Li, Yi, VLDB’16] 18 Random Sampling on Big Data

  19. Indexing Spatial Data  Numerous spatial indexing structures in the literature R-tree 19 Random Sampling on Big Data

  20. RS-tree  Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜) 20 Random Sampling on Big Data

  21. RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 21 Random Sampling on Big Data

  22. RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 22 Random Sampling on Big Data

  23. RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 7 or 14 with equal prob. 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23 Random Sampling on Big Data

  24. RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 3, 8, or 14 7 14 with prob. 1:1:2 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 Random Sampling on Big Data

  25. RS-tree: A 1D Example Report: 5 7 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 25 Random Sampling on Big Data

  26. RS-tree: A 1D Example Report: 5 7 12 Active nodes 5 Pick 3, 8, or 12 7 14 with equal prob 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 26 Random Sampling on Big Data

  27. Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

  28. Frequency Estimation on Distributed Data  Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets  𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦 𝑗𝑘 : frequency of item 𝑗 on node 𝑘 – Global count 𝑧 𝑗 = 𝑦 𝑗𝑘 𝑘  Goal: Estimate 𝑧 𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧 𝑗 – Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM’11] 28 Random Sampling on Big Data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend