Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk

“Big Data” in one slide The 3 V’s :  Volume  Velocity  Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data 2 Random Sampling on Big Data

Dealing with Big Data  The first approach: scale up / out the computation  Many great technical innovations: – Distributed/parallel systems – Simpler programming models • MapReduce, Pregel, Dremel, Spark… • BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL  This talk is not about this approach! 3 Random Sampling on Big Data

Downsizing data  A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms • See tutorial by Graham Cormode for other data summaries  Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures • Good old RAM model no longer applies 4 Random Sampling on Big Data

Outline for the talk  Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries  Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins)  Will jump back and forth between theory and practice 5 Random Sampling on Big Data

Simple Random Sampling  Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times The statistical difference is  Sampling with replacement very small, for 𝑜 ≫ 𝑡 – Randomly draw an element – Put it back – Repeat s times  Trivial in the RAM model 6 Random Sampling on Big Data

Random Sampling from a Data Stream  A stream of elements coming in at high speed  Limited memory  Need to maintain the sample continuously  Applications – Data stored on disk – Network traffic 7 Random Sampling on Big Data

8 Random Sampling on Big Data

Reservoir Sampling  Maintain a sample of size 𝑡 drawn (without replacement) from all elements in the stream so far  Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡  Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜 , use it to replace an item in the current sample chosen uniformly at random – With probability 1 − 𝑡/𝑜 , throw it away  Perhaps the first “streaming” algorithm [Waterman ??; Knuth’s book] 9 Random Sampling on Big Data

Correctness Proof  By induction on 𝑜 – 𝑜 = 𝑡 : trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1 : • The new element is sampled with probability 𝑡 𝑜+1 • Any element in the current sample is sampled with probability 𝑡 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡 𝑡 𝑜 ⋅ 1 − 𝑜+1 + = 𝑜+1 . Yeah! 𝑡  This is a wrong (incomplete) proof 𝑡  Each element being sampled with probability 𝑜 is not a sufficient condition of random sampling – Counter example: Divide elements into groups of 𝑡 and pick one group randomly 10 Random Sampling on Big Data

11 Random Sampling on Big Data

Reservoir Sampling Correctness Proof  Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same probability to be the sample  Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12 Random Sampling on Big Data

Sampling from Distributed Streams  One coordinator and 𝑙 sites  Each site can communicate with the coordinator  Goal: Maintain a random sample of size 𝑡 over the union of all streams with minimum communication  Difficulty: Don’t know 𝑜 , so can’t run reservoir sampling algorithm  Key observation: Don’t have to know 𝑜 in order to sample! [Cormode, Muthukrishnan , Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura , DISC’11] 13 Random Sampling on Big Data

Reduction from Coin Flip Sampling  Flip a fair coin for each element until we get “1”  An element is active on a level if it is “0”  If a level has ≥ 𝑡 active elements, we can draw a sample from those active elements  Key: The coordinator does not want all the active elements, which are too many! – Choose a level appropriately 14 Random Sampling on Big Data

The Algorithm  Initialize 𝑗 ← 0  In round 𝑗 : – Sites send in every item w.p. 2 −𝑗 (This is a coin-flip sample with prob. 2 −𝑗 ) – Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 −(𝑗+1) ) – When the lower sample reaches size 𝑡 , the coordinator broadcasts to advance to round 𝑗 ← 𝑗 + 1 – Discard the upper sample – Split the lower sample into a new lower sample and a higher sample 15 Random Sampling on Big Data

Communication Cost of Algorithm  Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙)  Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2 −𝑗 to contribute: need Θ(2 𝑗 𝑡) items  Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log 𝑙/𝑡 𝑜 + 𝑡 log 𝑜) – A matching lower bound  Sliding windows 16 Random Sampling on Big Data

Random Sampling for Range Queries [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award] 17 Random Sampling on Big Data

Online Range Sampling  Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination.  Naïve solutions:  Parameters: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – 𝑜: data size 𝑡𝑜 – Sample then query: 𝑃 𝑟 – 𝑟: query size (store data in random order) – 𝑡: sample size 𝑡𝑜 (not known beforehand)  New solution: 𝑃 𝑔 + 𝑡 𝑟 – 𝑜 ≫ 𝑟 ≫ 𝑡 𝑔(𝑦) : # canonical nodes in tree of size 𝑦 , between log 𝑦 and 𝑦 [Wang, Christensen, Li, Yi, VLDB’16] 18 Random Sampling on Big Data

Indexing Spatial Data  Numerous spatial indexing structures in the literature R-tree 19 Random Sampling on Big Data

RS-tree  Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜) 20 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 21 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 22 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 7 or 14 with equal prob. 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 3, 8, or 14 7 14 with prob. 1:1:2 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 7 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 25 Random Sampling on Big Data

RS-tree: A 1D Example Report: 5 7 12 Active nodes 5 Pick 3, 8, or 12 7 14 with equal prob 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 26 Random Sampling on Big Data

Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

Frequency Estimation on Distributed Data  Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets  𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦 𝑗𝑘 : frequency of item 𝑗 on node 𝑘 – Global count 𝑧 𝑗 = 𝑦 𝑗𝑘 𝑘  Goal: Estimate 𝑧 𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧 𝑗 – Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM’11] 28 Random Sampling on Big Data

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Global Illumination Multi-Sampling Path Tracing Simple Sampling Josef talked about all of

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Increasing the Convergence Domain of RGB-D Direct Registration Methods for Vision-based

Interactive Visualization of a News Clips Network: A Journalistic Research and Knowledge Discovery

High-Dimensional Pattern Recognition via Sparse Representation Allen Y. Yang Department of EECS,

Geospatial solutions for the Himalayas SERVIR Himalaya Sebastian Wesselman The International

Syed Aftab Rashid id, Geoffrey Nelissen and Eduardo Tovar 4/12/2016 Main CPU Cache Memory

Relating and Visualising CSP , VCR and Structural Traces Neil Brown 1 Marc Smith 2 1 Computing

Discovery of Personal Processes from Labeled Sensor Data An Application of Process Mining to

Object-Oriented Genetic Improvement for Improved Energy Consumption in Google Guava Nathan Burles

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Global Illumination Multi-Sampling Path Tracing Simple Sampling Josef talked about all of

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Increasing the Convergence Domain of RGB-D Direct Registration Methods for Vision-based

Interactive Visualization of a News Clips Network: A Journalistic Research and Knowledge Discovery

High-Dimensional Pattern Recognition via Sparse Representation Allen Y. Yang Department of EECS,

Geospatial solutions for the Himalayas SERVIR Himalaya Sebastian Wesselman The International

Syed Aftab Rashid id, Geoffrey Nelissen and Eduardo Tovar 4/12/2016 Main CPU Cache Memory

Relating and Visualising CSP , VCR and Structural Traces Neil Brown 1 Marc Smith 2 1 Computing

Discovery of Personal Processes from Labeled Sensor Data An Application of Process Mining to

Object-Oriented Genetic Improvement for Improved Energy Consumption in Google Guava Nathan Burles

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling