Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - - PowerPoint PPT Presentation

random sampling on big data techniques and applications
SMART_READER_LITE
LIVE PREVIEW

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a


slide-1
SLIDE 1

Random Sampling on Big Data: Techniques and Applications

Ke Yi

Hong Kong University of Science and Technology yike@ust.hk

slide-2
SLIDE 2

“Big Data” in one slide

The 3 V’s:

 Volume  Velocity  Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data

Random Sampling on Big Data

2

slide-3
SLIDE 3

Dealing with Big Data

 The first approach: scale up / out the computation  Many great technical innovations: – Distributed/parallel systems – Simpler programming models

  • MapReduce, Pregel, Dremel, Spark…
  • BSP

– Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL  This talk is not about this approach!

Random Sampling on Big Data

3

slide-4
SLIDE 4

Downsizing data

 A second approach to computational scalability:

scale down the data!

– A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms

  • See tutorial by Graham Cormode for other data summaries

 Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures

  • Good old RAM model no longer applies

Random Sampling on Big Data

4

slide-5
SLIDE 5

Outline for the talk

 Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries  Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins)  Will jump back and forth between theory and practice

Random Sampling on Big Data

5

slide-6
SLIDE 6

Simple Random Sampling

 Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times  Sampling with replacement – Randomly draw an element – Put it back – Repeat s times  Trivial in the RAM model

Random Sampling on Big Data

6

The statistical difference is very small, for 𝑜 ≫ 𝑡

slide-7
SLIDE 7

Random Sampling from a Data Stream

 A stream of elements coming in at high speed  Limited memory  Need to maintain the sample continuously  Applications – Data stored on disk – Network traffic

Random Sampling on Big Data

7

slide-8
SLIDE 8

Random Sampling on Big Data

8

slide-9
SLIDE 9

Reservoir Sampling

 Maintain a sample of size 𝑡 drawn (without replacement)

from all elements in the stream so far

 Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡  Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜, use it to replace an item in the current

sample chosen uniformly at random

– With probability 1 − 𝑡/𝑜, throw it away  Perhaps the first “streaming” algorithm

Random Sampling on Big Data

9

[Waterman ??; Knuth’s book]

slide-10
SLIDE 10

Correctness Proof

 By induction on 𝑜 – 𝑜 = 𝑡: trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1:

  • The new element is sampled with probability 𝑡

𝑜+1

  • Any element in the current sample is sampled with probability

𝑡

𝑜 ⋅ 1 − 𝑡 𝑜+1 + 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡

=

𝑡 𝑜+1. Yeah!

 This is a wrong (incomplete) proof  Each element being sampled with probability 𝑡 𝑜 is not a sufficient

condition of random sampling

– Counter example: Divide elements into groups of 𝑡 and pick one

group randomly

Random Sampling on Big Data

10

slide-11
SLIDE 11

Random Sampling on Big Data

11

slide-12
SLIDE 12

Reservoir Sampling Correctness Proof

 Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same

probability to be the sample

 Correct proof relates with the Fisher-Yates shuffle

Random Sampling on Big Data

12

a b c d b a c d a b c d b c a d b d a c s = 2

slide-13
SLIDE 13

Sampling from Distributed Streams

 One coordinator and 𝑙 sites  Each site can communicate with

the coordinator

 Goal: Maintain a random sample

  • f size 𝑡 over the union of all

streams with minimum communication

 Difficulty: Don’t know 𝑜, so can’t

run reservoir sampling algorithm

 Key observation: Don’t have to

know 𝑜 in order to sample!

Random Sampling on Big Data

13

[Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]

slide-14
SLIDE 14

Reduction from Coin Flip Sampling

 Flip a fair coin for each element until we get “1”  An element is active on a level if it is “0”  If a level has ≥ 𝑡 active elements, we can draw a sample from

those active elements

 Key: The coordinator does not want all the active elements,

which are too many!

– Choose a level appropriately

Random Sampling on Big Data

14

slide-15
SLIDE 15

The Algorithm

 Initialize 𝑗 ← 0  In round 𝑗: – Sites send in every item w.p. 2−𝑗

(This is a coin-flip sample with prob. 2−𝑗)

– Coordinator maintains a lower sample and a higher sample:

each received item goes to either with equal prob. (The lower sample is a sample with prob. 2−(𝑗+1))

– When the lower sample reaches size 𝑡, the coordinator

broadcasts to advance to round 𝑗 ← 𝑗 + 1

– Discard the upper sample – Split the lower sample into a new lower sample and a higher

sample

Random Sampling on Big Data

15

slide-16
SLIDE 16

Communication Cost of Algorithm

 Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙)  Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2−𝑗 to contribute: need Θ(2𝑗𝑡) items  Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log𝑙/𝑡𝑜 + 𝑡 log 𝑜) – A matching lower bound  Sliding windows

Random Sampling on Big Data

16

slide-17
SLIDE 17

Random Sampling for Range Queries

Random Sampling on Big Data

17

[Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]

slide-18
SLIDE 18

Online Range Sampling

 Problem Definition: Preprocess a set of points in the plane, so

that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination.

 Parameters: – 𝑜: data size – 𝑟: query size – 𝑡: sample size

(not known beforehand)

– 𝑜 ≫ 𝑟 ≫ 𝑡

Random Sampling on Big Data

18

 Naïve solutions: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – Sample then query: 𝑃

𝑡𝑜 𝑟

(store data in random order)

 New solution: 𝑃 𝑔 𝑡𝑜 𝑟

+ 𝑡

[Wang, Christensen, Li, Yi, VLDB’16]

𝑔(𝑦): # canonical nodes in tree of size 𝑦, between log 𝑦 and 𝑦

slide-19
SLIDE 19

Indexing Spatial Data

 Numerous spatial indexing structures in the literature

Random Sampling on Big Data

19

R-tree

slide-20
SLIDE 20

RS-tree

 Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜)

Random Sampling on Big Data

20

slide-21
SLIDE 21

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: Active nodes 5

Random Sampling on Big Data

21

slide-22
SLIDE 22

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 Active nodes

Random Sampling on Big Data

22

slide-23
SLIDE 23

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 Active nodes 7

Pick 7 or 14 with equal prob.

Random Sampling on Big Data

23

slide-24
SLIDE 24

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes

Pick 3, 8, or 14 with prob. 1:1:2

Random Sampling on Big Data

24

slide-25
SLIDE 25

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes

Random Sampling on Big Data

25

slide-26
SLIDE 26

RS-tree: A 1D Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4 5 7 9 12 14 16 3 8 12 14 7 14 5 Report: 5 7 Active nodes 12

Pick 3, 8, or 12 with equal prob

Random Sampling on Big Data

26

slide-27
SLIDE 27

Not-So-Simple Random Sampling

When simple random sampling is not optimal/feasible

slide-28
SLIDE 28

Frequency Estimation on Distributed Data

 Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets  𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦𝑗𝑘: frequency of item 𝑗 on node 𝑘 – Global count 𝑧𝑗 = 𝑦𝑗𝑘

𝑘

 Goal: Estimate 𝑧𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧𝑗 – Heavy hitters are estimated well

Random Sampling on Big Data

28

[Huang, Yi, Liu, Chen, INFOCOM’11]

slide-29
SLIDE 29

Frequency Estimation: Standard Solutions

 Local heavy hitters

– Let 𝑜𝑘 = 𝑦𝑗𝑘

𝑗

be the data size at node 𝑘

– Node 𝑘 sends in all items with frequency ≥ 𝜁𝑜𝑘 – Total error is at most 𝜁𝑜𝑘 = 𝜁𝑜

𝑘

– Communication cost: 𝑃(𝑙/𝜁)

 Simple random sampling

– A simple random sample of size 𝑃(1/𝜁2) can be used to estimate the

frequency of any item with error 𝜁𝑜

  • Extra log factor for all items

– Algorithm

  • Coordinator first gets 𝑜𝑘 for all 𝑘
  • Decides how many samples to get from each 𝑘
  • Get the samples from the nodes

– Communication cost: 𝑃(𝑙 + 1/𝜁2)

Random Sampling on Big Data

29

slide-30
SLIDE 30

Importance Sampling

Random Sampling on Big Data

30

Horvitz–Thompson estimator:

𝑌𝑗𝑘 = 𝑦𝑗𝑘 𝑕 𝑦𝑗𝑘 𝑗𝑔 𝑦𝑗𝑘 𝑡𝑏𝑛𝑞𝑚𝑓𝑒 𝑓𝑚𝑡𝑓

Estimator for global count 𝑧𝑗: 𝑍

𝑗 = 𝑌𝑗,1 + ⋯ + 𝑌𝑗,𝑙

slide-31
SLIDE 31

Importance Sampling: What is a good 𝒉(𝒚)?

 Natural choice: 𝑕1 𝑦 = 𝑙 𝜁𝑜 𝑦 – More precisely: 𝑕1 𝑦 = max

𝑙 𝜁𝑜 𝑦, 1

– Can show: 𝑊𝑏𝑠 𝑍

𝑗 = 𝑃

𝜁𝑜 2 for any 𝑗

– Communication cost: 𝑃

𝑙/𝜁

– This is (worst-case) optimal  Interesting discovery: 𝑕2 𝑦 = 𝑕1 𝑦 2 – Also has 𝑊𝑏𝑠 𝑍

𝑗 = 𝑃

𝜁𝑜 2 for any 𝑗

– Also has communication cost 𝑃

𝑙/𝜁 in the worst case

– But can be much lower than 𝑕1(𝑦) on some inputs

Random Sampling on Big Data

31

slide-32
SLIDE 32

𝒉𝟑 𝒚 is Instance-Optimal

Random Sampling on Big Data

32

slide-33
SLIDE 33

Median and Quantiles (order statistics)

 Exact quantiles: 𝐺−1() for 0 <  < 1, 𝐺  : CDF  Approximate version: tolerate answer between 𝐺−1( −

) … 𝐺−1( + )

Random Sampling on Big Data

33

slide-34
SLIDE 34

Estimating Median by Random Sampling

 Simple random sampling – An 𝜁-approximation needs a sample

  • f size Θ(1/𝜁2)

 Paired Sampling – Divide data into chunks of size 𝑡 = 𝑃(1/𝜁) – Sort each chunk – Do binary merges into one chunk – Each merge takes odd-positioned

  • r even-positioned elements with

equal probability

– Similar ideas used in discrepancy methods  This needs 𝑃(𝑜 log 𝑡) time, how is it useful?

Random Sampling on Big Data

34

1 5 6 7 8 2 3 4 9 10 1 3 5 7 9

+

slide-35
SLIDE 35

Application 1: Streaming Computation

 Can merge chunks up as items arrive

in the stream

 At any time, keep at most 𝑃 log 𝑜

chunks

 Space: 𝑃(1/𝜁 ⋅ log 𝑜) – Can be improved to 𝑃(1/𝜁 ⋅ log(1/𝜁)) by combining with

random sampling

– Can find all quantiles [Felber, Ostrovsky, ’15]  Reservoir sampling needs 𝑃(1/𝜁2) space  Best deterministic algorithm needs 𝑃(1/𝜁 ⋅ log 𝑜) space

[Greenwald, Khana, ’01]

Random Sampling on Big Data

35

[Wang, Luo, Yi, Cormode, SIGMOD’13]

slide-36
SLIDE 36

Application 2: Distributed Data

 Data stored on 𝑙 nodes  Each node reduces its data to 𝑃 1/𝜁 𝑙 using paired

sampling, and send to coordinator

 The coordinator reduces all the data received to a size of

𝑃(1/𝜁) using paired sampling

 Communication cost: 𝑃( 𝑙/𝜁)  Looks familiar?

Random Sampling on Big Data

36

slide-37
SLIDE 37

Generalization: -approximations

 A sample that preserves “density” of point sets – For any range (e.g., a circle), – |fraction of sample points − fraction of all points|≤ 𝜁 – Simple random sample needs size 𝑃

(1/𝜁2)

– Paired sampling yields size 𝑃

(1/𝜁2𝑒/(𝑒+1))

Random Sampling on Big Data

37

slide-38
SLIDE 38

𝜁-approximations on distributed data

Random Sampling on Big Data

38

[Huang, Yi, FOCS’14]

slide-39
SLIDE 39

Database Workloads

 Transactional (OLTP) – Deduct 𝑦 dollars from account A, credit 𝑦 dollars to account B – Challenge: Efficiency and correctness (ACID)  Analytical (OLAP) – Large fraction of data – Many tables – Complex conditions – Challenge: Efficiency – Correctness?

Wander Join: Online Aggregation via Random Walks

39

slide-40
SLIDE 40

Complex Analytical Queries (TPC-H)

SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'

This query finds the total revenue loss due to returned orders in a given region.

Wander Join: Online Aggregation via Random Walks

40

slide-41
SLIDE 41

Online Aggregation [Haas, Hellerstein, Wang, SIGMOD’97]

SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME 60000 CONFIDENCE 95 REPORTINTERVAL 1000

Confidence interval: Pr 𝑍 − 𝜁 < 𝑍 < 𝑍 + 𝜁 > 0.95

Wander Join: Online Aggregation via Random Walks

41

𝑍 𝑍 − 𝜁 𝑍 + 𝜁

slide-42
SLIDE 42

Ripple Join [Haas, Hellerstein, SIGMOD’99]

 Store tuples in each table in random order  In each step – Reads the next tuple from a table in a round-robin fashion – Join with sampled tuples from other tables  Works well for full Cartesian product – But most joins are sparse …

Wander Join: Online Aggregation via Random Walks

42

slide-43
SLIDE 43

A Running Example

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

43

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

What’s the total revenue of all orders from customers in China?

𝑂: size of each table, e.g., 109 𝑜: # tuples taken from each table 𝑡: # estimators, e.g., 103 𝑜3 ⋅ 1 𝑂2 = 𝑡 𝑜 = 𝑂2/3𝑡1/3 = 107

slide-44
SLIDE 44

Join as a Graph

Wander Join: Online Aggregation via Random Walks

44

𝑆1 𝑆2 𝑆3

Conceptual only Never materialized

slide-45
SLIDE 45

Join as a Graph

Wander Join: Online Aggregation via Random Walks

45

𝑆1 𝑆2 𝑆3

Conceptual only Never materialized

slide-46
SLIDE 46

Join as a Graph

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

46

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

slide-47
SLIDE 47

Sampling by Random Walks

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

47

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

slide-48
SLIDE 48

Sampling by Random Walks

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

48

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

slide-49
SLIDE 49

Sampling by Random Walks

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

49

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

slide-50
SLIDE 50

Sampling by Random Walks

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

50

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 1 3 2 1 3 5 4 5 5 5 6 3 7 5 8 3 9 7 10

Unbiased estimator:

$𝟔𝟏𝟏 𝐭𝐛𝐧𝐪𝐦𝐣𝐨𝐡 𝐪𝐬𝐩𝐜. = $𝟔𝟏𝟏 𝟐/𝟒⋅𝟐/𝟓⋅𝟐/𝟒

𝑂: size of each table size, e.g., 109 𝑜: # tuples taken from each table = # random walks 𝑡: # estimators, e.g., 103 𝑜 = 𝑡 = 103

slide-51
SLIDE 51

Walk Plan Optimization

 Structure of the data graph  Selection predicates – Starting table: use index – Table in the middle: reject random walk  Data distribution – Non-uniformity

may not be a bad thing!

Wander Join: Online Aggregation via Random Walks

51

𝑆1 𝑆2 𝑆3

1 1 1 1 5 6 3

𝑆1 𝑆2 𝑊𝑏𝑠 𝑆1 → 𝑆2 < 𝑊𝑏𝑠 𝑆2 → 𝑆1 𝑆1 𝑆2

5 6 3 1 1 1 1

𝑊𝑏𝑠 𝑆1 → 𝑆2 > 𝑊𝑏𝑠 𝑆2 → 𝑆1

slide-52
SLIDE 52

Walk Plan Optimizer

 Enumerate all plans  Conduct ~ 100 trial random walks using each plan  Measure the variance of each plan  Select the best plan  All trials runs are still useful

Wander Join: Online Aggregation via Random Walks

52

slide-53
SLIDE 53

Convergence Comparison

Wander Join: Online Aggregation via Random Walks

53

[Li, Wu, Yi, Zhao, SIGMOD’16 Best Paper Award]

slide-54
SLIDE 54

Wander Join in PostgreSQL

Logarithmic growth due to B-tree lookup to find random neighbours

Wander Join: Online Aggregation via Random Walks

54

slide-55
SLIDE 55

Running on Insufficient Memory (4GB)

Wander Join: Online Aggregation via Random Walks

55

 Insufficient memory incurs a heavy, one-time penalty  Growth is still logarithmic  Fundamentally: Random sampling at odds with hard disks – But does it matter? Spark, In-Memory DB, RAM cloud… – The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB’09]

slide-56
SLIDE 56

Accuracy Achieved in 1/10 Time of Full Join

Wander Join: Online Aggregation via Random Walks

56

slide-57
SLIDE 57

Wander Join vs Ripple Join

Wander Join: Online Aggregation via Random Walks

57

Wander Join Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, 𝑃(𝑜) time Complicated, 𝑃(𝑜𝑙) time 𝑙: # tables Convergence time (20GB data, 3 tables) ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO

slide-58
SLIDE 58

Online Aggregation vs Data Cube

Wander Join: Online Aggregation via Random Walks

58

Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report

slide-59
SLIDE 59

Thank you!

slide-60
SLIDE 60

Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD’90]

Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10

Wander Join: Online Aggregation via Random Walks

60

OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 BuyerID OrderID 4 8 3 5 1 3 5 4 5 2 5 3 3 7 5 1 3 9 7 10

slide-61
SLIDE 61

Sampling from a B-tree [Olken, ’93]

 Sampling from an aggregate (ranked) B-tree is easy  But – incurs heavy cost for transactions – need to modify existing B-tree implementations

Wander Join: Online Aggregation via Random Walks

61

4 2 3

slide-62
SLIDE 62

Rejection Sampling [Olken, ’93]

 Imagine each node has maximum fanout  Reject as soon as it walks out of bound

Wander Join: Online Aggregation via Random Walks

62

slide-63
SLIDE 63

Non-Uniform Sampling

 As long as we can compute the sampling probability, wander

join still works!

Wander Join: Online Aggregation via Random Walks

63

1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 4 1 3 ⋅ 2 1 3 ⋅ 3 1 3 ⋅ 3 1 3 ⋅ 3 1 3 ⋅ 2

slide-64
SLIDE 64

Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden,

Stoica, ’13]

Wander Join: Online Aggregation via Random Walks

64

Wander Join BlinkDB Methodology Query  Sampling Sampling  Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced