Random Sampling on Big Data: Techniques and Applications
Ke Yi
Hong Kong University of Science and Technology yike@ust.hk
Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - - PowerPoint PPT Presentation
Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a
Hong Kong University of Science and Technology yike@ust.hk
Volume Velocity Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data
Random Sampling on Big Data
2
The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models
– Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach!
Random Sampling on Big Data
3
A second approach to computational scalability:
– A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms
Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures
Random Sampling on Big Data
4
Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice
Random Sampling on Big Data
5
Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model
Random Sampling on Big Data
6
A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic
Random Sampling on Big Data
7
Random Sampling on Big Data
8
Maintain a sample of size 𝑡 drawn (without replacement)
Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡 Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜, use it to replace an item in the current
– With probability 1 − 𝑡/𝑜, throw it away Perhaps the first “streaming” algorithm
Random Sampling on Big Data
9
[Waterman ??; Knuth’s book]
By induction on 𝑜 – 𝑜 = 𝑡: trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1:
𝑜+1
𝑜 ⋅ 1 − 𝑡 𝑜+1 + 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡
𝑡 𝑜+1. Yeah!
This is a wrong (incomplete) proof Each element being sampled with probability 𝑡 𝑜 is not a sufficient
– Counter example: Divide elements into groups of 𝑡 and pick one
Random Sampling on Big Data
10
Random Sampling on Big Data
11
Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same
Correct proof relates with the Fisher-Yates shuffle
Random Sampling on Big Data
12
a b c d b a c d a b c d b c a d b d a c s = 2
One coordinator and 𝑙 sites Each site can communicate with
Goal: Maintain a random sample
Difficulty: Don’t know 𝑜, so can’t
Key observation: Don’t have to
Random Sampling on Big Data
13
[Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]
Flip a fair coin for each element until we get “1” An element is active on a level if it is “0” If a level has ≥ 𝑡 active elements, we can draw a sample from
Key: The coordinator does not want all the active elements,
– Choose a level appropriately
Random Sampling on Big Data
14
Initialize 𝑗 ← 0 In round 𝑗: – Sites send in every item w.p. 2−𝑗
– Coordinator maintains a lower sample and a higher sample:
– When the lower sample reaches size 𝑡, the coordinator
– Discard the upper sample – Split the lower sample into a new lower sample and a higher
Random Sampling on Big Data
15
Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙) Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2−𝑗 to contribute: need Θ(2𝑗𝑡) items Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log𝑙/𝑡𝑜 + 𝑡 log 𝑜) – A matching lower bound Sliding windows
Random Sampling on Big Data
16
Random Sampling on Big Data
17
[Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]
Problem Definition: Preprocess a set of points in the plane, so
Parameters: – 𝑜: data size – 𝑟: query size – 𝑡: sample size
– 𝑜 ≫ 𝑟 ≫ 𝑡
Random Sampling on Big Data
18
Naïve solutions: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – Sample then query: 𝑃
𝑡𝑜 𝑟
New solution: 𝑃 𝑔 𝑡𝑜 𝑟
[Wang, Christensen, Li, Yi, VLDB’16]
Numerous spatial indexing structures in the literature
Random Sampling on Big Data
19
Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜)
Random Sampling on Big Data
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: Active nodes 5
Random Sampling on Big Data
21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 Active nodes
Random Sampling on Big Data
22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 Active nodes 7
Pick 7 or 14 with equal prob.
Random Sampling on Big Data
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes
Pick 3, 8, or 14 with prob. 1:1:2
Random Sampling on Big Data
24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes
Random Sampling on Big Data
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes 12
Pick 3, 8, or 12 with equal prob
Random Sampling on Big Data
26
Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets 𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦𝑗𝑘: frequency of item 𝑗 on node 𝑘 – Global count 𝑧𝑗 = 𝑦𝑗𝑘
𝑘
Goal: Estimate 𝑧𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧𝑗 – Heavy hitters are estimated well
Random Sampling on Big Data
28
[Huang, Yi, Liu, Chen, INFOCOM’11]
Local heavy hitters
– Let 𝑜𝑘 = 𝑦𝑗𝑘
𝑗
be the data size at node 𝑘
– Node 𝑘 sends in all items with frequency ≥ 𝜁𝑜𝑘 – Total error is at most 𝜁𝑜𝑘 = 𝜁𝑜
𝑘
– Communication cost: 𝑃(𝑙/𝜁)
Simple random sampling
– A simple random sample of size 𝑃(1/𝜁2) can be used to estimate the
frequency of any item with error 𝜁𝑜
– Algorithm
– Communication cost: 𝑃(𝑙 + 1/𝜁2)
Random Sampling on Big Data
29
Random Sampling on Big Data
30
𝑌𝑗𝑘 = 𝑦𝑗𝑘 𝑦𝑗𝑘 𝑗𝑔 𝑦𝑗𝑘 𝑡𝑏𝑛𝑞𝑚𝑓𝑒 𝑓𝑚𝑡𝑓
𝑗 = 𝑌𝑗,1 + ⋯ + 𝑌𝑗,𝑙
Natural choice: 1 𝑦 = 𝑙 𝜁𝑜 𝑦 – More precisely: 1 𝑦 = max
𝑙 𝜁𝑜 𝑦, 1
– Can show: 𝑊𝑏𝑠 𝑍
𝑗 = 𝑃
– Communication cost: 𝑃
– This is (worst-case) optimal Interesting discovery: 2 𝑦 = 1 𝑦 2 – Also has 𝑊𝑏𝑠 𝑍
𝑗 = 𝑃
– Also has communication cost 𝑃
– But can be much lower than 1(𝑦) on some inputs
Random Sampling on Big Data
31
Random Sampling on Big Data
32
Exact quantiles: 𝐺−1() for 0 < < 1, 𝐺 : CDF Approximate version: tolerate answer between 𝐺−1( −
Random Sampling on Big Data
33
Simple random sampling – An 𝜁-approximation needs a sample
Paired Sampling – Divide data into chunks of size 𝑡 = 𝑃(1/𝜁) – Sort each chunk – Do binary merges into one chunk – Each merge takes odd-positioned
– Similar ideas used in discrepancy methods This needs 𝑃(𝑜 log 𝑡) time, how is it useful?
Random Sampling on Big Data
34
1 5 6 7 8 2 3 4 9 10 1 3 5 7 9
Can merge chunks up as items arrive
At any time, keep at most 𝑃 log 𝑜
Space: 𝑃(1/𝜁 ⋅ log 𝑜) – Can be improved to 𝑃(1/𝜁 ⋅ log(1/𝜁)) by combining with
– Can find all quantiles [Felber, Ostrovsky, ’15] Reservoir sampling needs 𝑃(1/𝜁2) space Best deterministic algorithm needs 𝑃(1/𝜁 ⋅ log 𝑜) space
Random Sampling on Big Data
35
[Wang, Luo, Yi, Cormode, SIGMOD’13]
Data stored on 𝑙 nodes Each node reduces its data to 𝑃 1/𝜁 𝑙 using paired
The coordinator reduces all the data received to a size of
Communication cost: 𝑃( 𝑙/𝜁) Looks familiar?
Random Sampling on Big Data
36
A sample that preserves “density” of point sets – For any range (e.g., a circle), – |fraction of sample points − fraction of all points|≤ 𝜁 – Simple random sample needs size 𝑃
– Paired sampling yields size 𝑃
Random Sampling on Big Data
37
Random Sampling on Big Data
38
[Huang, Yi, FOCS’14]
Transactional (OLTP) – Deduct 𝑦 dollars from account A, credit 𝑦 dollars to account B – Challenge: Efficiency and correctness (ACID) Analytical (OLAP) – Large fraction of data – Many tables – Complex conditions – Challenge: Efficiency – Correctness?
Wander Join: Online Aggregation via Random Walks
39
SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
Wander Join: Online Aggregation via Random Walks
40
SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME 60000 CONFIDENCE 95 REPORTINTERVAL 1000
Wander Join: Online Aggregation via Random Walks
41
Store tuples in each table in random order In each step – Reads the next tuple from a table in a round-robin fashion – Join with sampled tuples from other tables Works well for full Cartesian product – But most joins are sparse …
Wander Join: Online Aggregation via Random Walks
42
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
43
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
Wander Join: Online Aggregation via Random Walks
44
Wander Join: Online Aggregation via Random Walks
45
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
46
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
47
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
48
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
49
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
50
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10
$𝟔𝟏𝟏 𝐭𝐛𝐧𝐪𝐦𝐣𝐨𝐡 𝐪𝐬𝐩𝐜. = $𝟔𝟏𝟏 𝟐/𝟒⋅𝟐/𝟓⋅𝟐/𝟒
Structure of the data graph Selection predicates – Starting table: use index – Table in the middle: reject random walk Data distribution – Non-uniformity
Wander Join: Online Aggregation via Random Walks
51
𝑆1 𝑆2 𝑆3
1 1 1 1 5 6 3
𝑆1 𝑆2 𝑊𝑏𝑠 𝑆1 → 𝑆2 < 𝑊𝑏𝑠 𝑆2 → 𝑆1 𝑆1 𝑆2
5 6 3 1 1 1 1
𝑊𝑏𝑠 𝑆1 → 𝑆2 > 𝑊𝑏𝑠 𝑆2 → 𝑆1
Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful
Wander Join: Online Aggregation via Random Walks
52
Wander Join: Online Aggregation via Random Walks
53
[Li, Wu, Yi, Zhao, SIGMOD’16 Best Paper Award]
Logarithmic growth due to B-tree lookup to find random neighbours
Wander Join: Online Aggregation via Random Walks
54
Wander Join: Online Aggregation via Random Walks
55
Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks – But does it matter? Spark, In-Memory DB, RAM cloud… – The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB’09]
Wander Join: Online Aggregation via Random Walks
56
Wander Join: Online Aggregation via Random Walks
57
Wander Join Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, 𝑃(𝑜) time Complicated, 𝑃(𝑜𝑙) time 𝑙: # tables Convergence time (20GB data, 3 tables) ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO
Wander Join: Online Aggregation via Random Walks
58
Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report
Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10
Wander Join: Online Aggregation via Random Walks
60
OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 8 3 5 1 3 5 4 5 2 5 3 3 7 5 1 3 9 7 10
Sampling from an aggregate (ranked) B-tree is easy But – incurs heavy cost for transactions – need to modify existing B-tree implementations
Wander Join: Online Aggregation via Random Walks
61
4 2 3
Imagine each node has maximum fanout Reject as soon as it walks out of bound
Wander Join: Online Aggregation via Random Walks
62
As long as we can compute the sampling probability, wander
Wander Join: Online Aggregation via Random Walks
63
1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 2 1 3 ⋅ 3 1 3 ⋅ 3 1 3 ⋅ 3 1 3 ⋅ 2
Stoica, ’13]
Wander Join: Online Aggregation via Random Walks
64
Wander Join BlinkDB Methodology Query Sampling Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced