[PPT] - ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin PowerPoint Presentation

SLIDE 1

ApproxJoin

Approximate Distributed Joins

10/2018

Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe

SLIDE 2

Motivation

1

R2 R4 ⨝ πX R1 ⨝ ⨝ R3

Join is a critical operation in big data analytics systems, but it is very expensive Reduce the overhead of join operations using a sampling-based approach

SLIDE 3

Motivation

2

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

…

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

SLIDE 4

Motivation

3

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

B2 A2 B5 A2 … Bn-2 A2 Sample(R1)

C3

A1

C4

A1 Cm-1 A1 … Sample(R2) Sample(R1) Sample(R2)

=

NULL

! = Sample(R1

R2)

Sampling over joins is a challenging task regarding the output quality

SLIDE 5

Motivation

4

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

…

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 …

None-join items Unnecessary data shuffle through cluster

SLIDE 6

State-of-the-art Systems

5

SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices )

SLIDE 7

State-of-the-art Systems

6

SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices ) Designed for single node system Do not support sampling over joins

SLIDE 8

Outline

Motivation
Design
Evaluation

7

SLIDE 9

ApproxJoin: System Overview

8

ApproxJoin

Input datasets

Filtering (Bloom filters)

Reduce shuffled data size Achieve Low latency

Sampling over distributed join +

SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%

Approximate Result

192.68 ± 0.05 (95% confidence)

R1 R2

…

Rn

SLIDE 10

ApproxJoin: Core Idea

9

R1 R2 Input datasets: JoinBF = BF(R1) BF(R2) & BF(R1) BF(R2) Build bloom filter: R1 R2 Sampling Join Result R1 JoinBF R2 JoinBF Filter out overlap items:

SLIDE 11

ApproxJoin: Filtering

10

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

…

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 … BF(R1) = {A1, A2, A3} BF(R2) = {A1, A2, A4} JoinBF = {A1, A2}

Use JoinBF to remove none-join items

SLIDE 12

ApproxJoin: Sampling

11

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

Stratified Sampling

… B0 A1 B0 A1 B0 A1 … C1 C3 Cm-2 B2 A2 B5 A2 Bn-3 A2 C0 C0 C0 = Sample(R1

R2)

B0 A1 C1 A1 C2 A1 Cm A1 … C0 A2 B1 A2 Bn A2 … B2 A2

CoGroup

SLIDE 13

ApproxJoin: Implementation

12

Stratified sampling during join operator Cluster configuration Input datasets (HDFS) Aggregation engine (Apache Spark) Error-bound estimator Result

192.68 ± 0.05 (95% confidence)

Sample sizes estimator (Cost-function) Multi-way Bloom filter constructor

SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%

SLIDE 14

Outline

Motivation
Design
Evaluation

13

SLIDE 15

Experimental Setup

Evaluation questions
Latency vs overlap fraction
Shuffled data size vs overlap fraction
Latency vs sampling fraction
Testbed
Cluster: 10 nodes
Datasets:
Synthesis: Poisson distribution datasets, TPC-H
CAIDA Network traffic traces; Netflix Prize

14

See the paper for more results!

SLIDE 16

Latency

0,1 1 10 100 1000 1 2 4 6 8 10 Latency (minutes) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join

15

Lower is better ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1%

SLIDE 17

Shuffled Data Size

0,1 1 10 100 1000 1 2 4 6 8 10 Size (MB) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join

16

Lower is better ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 %

SLIDE 18

Latency

0,1 1 10 100 1000 10 20 40 60 80 90 Latency (minutes) Sampling fraction (%) ApproxJoin Spark, sample after join Spark, sample before join

17

Lower is better (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join

SLIDE 19

Accuracy

0,001 0,01 0,1 1 10 100 10 20 40 60 80 90 Accuracy loss (%) Sampling fraction (%) ApproxJoin, sample during join Spark, sample after join Spark, sample before join

18

Lower is better Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join

SLIDE 20

Outline

Motivation
Our work
Conclusion

19

SLIDE 21

Conclusion

20

ApproxJoin: Approximate Distributed Joins

Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques

Thank you!

Transparent Supports applications w/ minor code changes