ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - - PowerPoint PPT Presentation

approxjoin
SMART_READER_LITE
LIVE PREVIEW

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - - PowerPoint PPT Presentation

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018 Motivation X Join is a critical operation in big data analytics systems, but it


slide-1
SLIDE 1

ApproxJoin

Approximate Distributed Joins

10/2018

Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe

slide-2
SLIDE 2

Motivation

1

R2 R4 ⨝ πX R1 ⨝ ⨝ R3

Join is a critical operation in big data analytics systems, but it is very expensive Reduce the overhead of join operations using a sampling-based approach

slide-3
SLIDE 3

Motivation

2

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

slide-4
SLIDE 4

Motivation

3

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

B2 A2 B5 A2 … Bn-2 A2 Sample(R1)

C3

A1

C4

A1 Cm-1 A1 … Sample(R2) Sample(R1) Sample(R2)

=

NULL

! = Sample(R1

R2)

Sampling over joins is a challenging task regarding the output quality

slide-5
SLIDE 5

Motivation

4

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 …

None-join items Unnecessary data shuffle through cluster

slide-6
SLIDE 6

State-of-the-art Systems

5

SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices )

slide-7
SLIDE 7

State-of-the-art Systems

6

SparkSQL (SIGMOD’15), SnappyData (SIGMOD’16) Using pre-existing samples to serve queries RippleJoin (SIGMOD’99), WanderJoin (SIGMOD’16) Using online aggregation approach for joins AQUA (SIGMOD’99) Sampling over joins (SIGMOD’99) Requiring priori knowledge of inputs (statistical info, indices ) Designed for single node system Do not support sampling over joins

slide-8
SLIDE 8

Outline

  • Motivation
  • Design
  • Evaluation

7

slide-9
SLIDE 9

ApproxJoin: System Overview

8

ApproxJoin

Input datasets

Filtering (Bloom filters)

Reduce shuffled data size Achieve Low latency

Sampling over distributed join +

SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%

Approximate Result

192.68 ± 0.05 (95% confidence)

R1 R2

Rn

slide-10
SLIDE 10

ApproxJoin: Core Idea

9

R1 R2 Input datasets: JoinBF = BF(R1) BF(R2) & BF(R1) BF(R2) Build bloom filter: R1 R2 Sampling Join Result R1 JoinBF R2 JoinBF Filter out overlap items:

slide-11
SLIDE 11

ApproxJoin: Filtering

10

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

R1 R2

B0 A1 B0 A1 B0 A1 … C1 C2 Cm B1 A2 B2 A2 Bn A2 C0 C0 C0

=

E1 A4 E2 A4 El A4 … D1 A3 D2 A3 Dk A3 … BF(R1) = {A1, A2, A3} BF(R2) = {A1, A2, A4} JoinBF = {A1, A2}

Use JoinBF to remove none-join items

slide-12
SLIDE 12

ApproxJoin: Sampling

11

B0 A1 B1 A2 Bn A2 … B2 A2

R1

C1 A1 C2 A1 Cm A1 … C0 A2

R2

Stratified Sampling

… B0 A1 B0 A1 B0 A1 … C1 C3 Cm-2 B2 A2 B5 A2 Bn-3 A2 C0 C0 C0 = Sample(R1

R2)

B0 A1 C1 A1 C2 A1 Cm A1 … C0 A2 B1 A2 Bn A2 … B2 A2

CoGroup

slide-13
SLIDE 13

ApproxJoin: Implementation

12

Stratified sampling during join operator Cluster configuration Input datasets (HDFS) Aggregation engine (Apache Spark) Error-bound estimator Result

192.68 ± 0.05 (95% confidence)

Sample sizes estimator (Cost-function) Multi-way Bloom filter constructor

SELECT SUM(R1.V + R2.V + … + Rn.V) FROM R1, R2, …, Rn WHERE R1.A = R2.A = … = Rn.A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95%

slide-14
SLIDE 14

Outline

  • Motivation
  • Design
  • Evaluation

13

slide-15
SLIDE 15

Experimental Setup

  • Evaluation questions
  • Latency vs overlap fraction
  • Shuffled data size vs overlap fraction
  • Latency vs sampling fraction
  • Testbed
  • Cluster: 10 nodes
  • Datasets:
  • Synthesis: Poisson distribution datasets, TPC-H
  • CAIDA Network traffic traces; Netflix Prize

14

See the paper for more results!

slide-16
SLIDE 16

Latency

0,1 1 10 100 1000 1 2 4 6 8 10 Latency (minutes) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join

15

Lower is better ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1%

slide-17
SLIDE 17

Shuffled Data Size

0,1 1 10 100 1000 1 2 4 6 8 10 Size (MB) Overlap fraction (%) ApproxJoin Spark repartition join Native Spark join

16

Lower is better ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 %

slide-18
SLIDE 18

Latency

0,1 1 10 100 1000 10 20 40 60 80 90 Latency (minutes) Sampling fraction (%) ApproxJoin Spark, sample after join Spark, sample before join

17

Lower is better (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join

slide-19
SLIDE 19

Accuracy

0,001 0,01 0,1 1 10 100 10 20 40 60 80 90 Accuracy loss (%) Sampling fraction (%) ApproxJoin, sample during join Spark, sample after join Spark, sample before join

18

Lower is better Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join

slide-20
SLIDE 20

Outline

  • Motivation
  • Our work
  • Conclusion

19

slide-21
SLIDE 21

Conclusion

20

ApproxJoin: Approximate Distributed Joins

Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques

Thank you!

Transparent Supports applications w/ minor code changes