ApproxJoin
Approximate Distributed Joins
10/2018
Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe
ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - - PowerPoint PPT Presentation
ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018 Motivation X Join is a critical operation in big data analytics systems, but it
10/2018
Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe
1
R2 R4 ⨝ πX R1 ⨝ ⨝ R3
Join is a critical operation in big data analytics systems, but it is very expensive Reduce the overhead of join operations using a sampling-based approach
2
B0 A1 B1 A2 Bn A2 … B2 A2
C1 A1 C2 A1 Cm A1 … C0 A2
…
B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0
3
B0 A1 B1 A2 Bn A2 … B2 A2
C1 A1 C2 A1 Cm A1 … C0 A2
B2 A2 B5 A2 … Bn-2 A2 Sample(R1)
C3
A1
C4
A1 Cm-1 A1 … Sample(R2) Sample(R1) Sample(R2)
NULL
! = Sample(R1
Sampling over joins is a challenging task regarding the output quality
4
B0 A1 B1 A2 Bn A2 … B2 A2
C1 A1 C2 A1 Cm A1 … C0 A2
…
B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0
E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 …
None-join items Unnecessary data shuffle through cluster
5
SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices )
6
SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices ) Designed for single node system Do not support sampling over joins
7
8
ApproxJoin
Input datasets
Filtering (Bloom filters)
Reduce shuffled data size Achieve Low latency
Sampling over distributed join +
SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%
Approximate Result
192.68 ± 0.05 (95% confidence)
R1 R2
…
Rn
9
R1 R2 Input datasets: JoinBF = BF(R1) BF(R2) & BF(R1) BF(R2) Build bloom filter: R1 R2 Sampling Join Result R1 JoinBF R2 JoinBF Filter out overlap items:
10
B0 A1 B1 A2 Bn A2 … B2 A2
C1 A1 C2 A1 Cm A1 … C0 A2
…
B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0
E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 … BF(R1) = {A1, A2, A3} BF(R2) = {A1, A2, A4} JoinBF = {A1, A2}
Use JoinBF to remove none-join items
11
B0 A1 B1 A2 Bn A2 … B2 A2
C1 A1 C2 A1 Cm A1 … C0 A2
Stratified Sampling
… B0 A1 B0 A1 B0 A1 … C1 C3 Cm-2 B2 A2 B5 A2 Bn-3 A2 C0 C0 C0 = Sample(R1
B0 A1 C1 A1 C2 A1 Cm A1 … C0 A2 B1 A2 Bn A2 … B2 A2
CoGroup
12
Stratified sampling during join operator Cluster configuration Input datasets (HDFS) Aggregation engine (Apache Spark) Error-bound estimator Result
192.68 ± 0.05 (95% confidence)
Sample sizes estimator (Cost-function) Multi-way Bloom filter constructor
SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%
13
14
See the paper for more results!
0,1 1 10 100 1000 1 2 4 6 8 10 Latency (minutes) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join
15
Lower is better ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1%
0,1 1 10 100 1000 1 2 4 6 8 10 Size (MB) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join
16
Lower is better ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 %
0,1 1 10 100 1000 10 20 40 60 80 90 Latency (minutes) Sampling fraction (%) ApproxJoin Spark, sample after join Spark, sample before join
17
Lower is better (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join
0,001 0,01 0,1 1 10 100 10 20 40 60 80 90 Accuracy loss (%) Sampling fraction (%) ApproxJoin, sample during join Spark, sample after join Spark, sample before join
18
Lower is better Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join
19
20
ApproxJoin: Approximate Distributed Joins
Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques
Transparent Supports applications w/ minor code changes