Insights of Approximate Query Processing Systems Presented by: - - PowerPoint PPT Presentation

insights of approximate query processing systems
SMART_READER_LITE
LIVE PREVIEW

Insights of Approximate Query Processing Systems Presented by: - - PowerPoint PPT Presentation

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda Introduction Background VerdictDB & SnappyData Experiment Setup Evaluation Insights Insights of Approximate Query Processing


slide-1
SLIDE 1

Insights of Approximate Query Processing Systems

Presented by: Huanyi Chen Ruoxi Zhang

slide-2
SLIDE 2

Agenda

§ Introduction § Background § VerdictDB & SnappyData § Experiment Setup § Evaluation § Insights

Insights of Approximate Query Processing Systems PAGE 2

slide-3
SLIDE 3

Why AQP?

Insights of Approximate Query Processing Systems PAGE 3

# of Day Income (CAD) 1 150 2 240 3 180 4 200 5 230 6 190 7 180 Avg(Income)

195.71

shop income

slide-4
SLIDE 4

Why AQP?

Insights of Approximate Query Processing Systems PAGE 4

# of Day Income (CAD) 1 150 2 240 3 180 4 200 5 230 6 190 7 180 Avg(Income)

186.67

shop income

more efficient (50% rows) accuracy > 95%

195.71

slide-5
SLIDE 5

Why AQP?

Insights of Approximate Query Processing Systems PAGE 5

99.9% Identical 100x-200x Faster

slide-6
SLIDE 6

Sampling Based AQP

Insights of Approximate Query Processing Systems PAGE 6

slide-7
SLIDE 7

Sampling Based AQP

Insights of Approximate Query Processing Systems PAGE 7

Query Column Set (QCS)

slide-8
SLIDE 8

Why SnappyData & VerdictDB ?

Insights of Approximate Query Processing Systems PAGE 8

Name Online/ Offline Distributed/ Standalone Platform Algorithm Skewed BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes Sapprox Online Distributed Hadoop Distribution-aware Online sampling No Approxhadoop Online Distributed Hadoop Approximation-enabled MapReduce No Quickr Online Distributed N/A ASALQA algorithm No SnappyData Online Distributed Spark and GemFire Spark as a computational engine; GemFire as transactional store No FluoDB Online Distributed Spark Mini-batch execution OLA Model No XDB Online Standalone PostgreSQL Wander join No VerdictDB Online Standalone Spark SQL Database learning No IDEA Online Standalone N/A Reuse answers of past

  • verlapping queries for new

query No BEAS Online Standalone Commercial DBMS Approximability theorem No ABS Online Standalone N/A Bootstrap No

  • Spark
  • Open-source*
slide-9
SLIDE 9

SnappyData

Insights of Approximate Query Processing Systems PAGE 9

SDE is NOT

  • pen

source

slide-10
SLIDE 10

SnappyData

Insights of Approximate Query Processing Systems PAGE 10

+ WITH ERROR QCS FRACTION

slide-11
SLIDE 11

VerdictDB

Insights of Approximate Query Processing Systems PAGE 11

slide-12
SLIDE 12

VerdictDB

Insights of Approximate Query Processing Systems PAGE 12

slide-13
SLIDE 13

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 13

§ Cluster Setup

§ SnappyData: 1 locator, 1 lead, and 2 servers

slide-14
SLIDE 14

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 14

§ Cluster Setup

§ SnappyData: 1 locator, 1 lead, and 2 servers § VerdictDB on Spark: 1 master and 2 executors

§ Each Node

§ 24/32 GB memory used § 500 GB HDD

slide-15
SLIDE 15

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 15

§ TPC-H Benchmark

§ OLAP § 22 queries includes Aggregation, Join, etc. § Well known and standard § Customizable

§ Data

§ 1GB and 10GB § Uniformly distributed

slide-16
SLIDE 16

Evaluation

Insights of Approximate Query Processing Systems PAGE 16

SnappyData

  • Stratified Sampling
  • In-memory

VerdictDB

  • Uniform Sampling
  • Not in-memory (bug?)
slide-17
SLIDE 17

SnappyData - Latency

Insights of Approximate Query Processing Systems PAGE 17

29439 1832 6629

8092 1399 3870 5,000 10,000 15,000 20,000 25,000 30,000 35,000

Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SnappyData SnappyData_AQP (>95% accuracy)

Q1: Up to 3.6x speedup ~0.0001 Error

slide-18
SLIDE 18

fraction 0.1

SnappyData - Accuracy

Insights of Approximate Query Processing Systems PAGE 18

Base Table

fraction 0.01

Sample Tables ...

0.00 0.00 0.01 0.01 0.02 0.02 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)

1,000 2,000 3,000 4,000 5,000 6,000 7,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Snappy

Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)

slide-19
SLIDE 19

SnappyData- Creating Sample Tables

Insights of Approximate Query Processing Systems PAGE 19

50,000 100,000 150,000 200,000 250,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Time (ms) for creating SnappyData sample tables with different fractions

slide-20
SLIDE 20

VerdictDB - Latency

Insights of Approximate Query Processing Systems PAGE 20

206034 88912 99195 17598 15210 24355 50,000 100,000 150,000 200,000 250,000 Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SparkSQL VerdictDB (> 95% accuracy)

Up to ~11x speedup!

slide-21
SLIDE 21

VerdictDB - Speedup

Insights of Approximate Query Processing Systems PAGE 21

2 4 6 8 10 12 14 Q1 Q6 Q14

Speedup for TPC-H (SF=10, fraction=0.1)

Speedup 2 4 6 8 10 12 Q1 Q6 Q14

Speedup for TPC-H (SF=1, fraction=0.1)

Speedup

slide-22
SLIDE 22

VerdictDB - Creating Sample Tables

Insights of Approximate Query Processing Systems PAGE 22

100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Time (ms) for creating VerdictDB sample tables with different fraction

slide-23
SLIDE 23

fraction 0.1

VerdictDB - Accuracy

Insights of Approximate Query Processing Systems PAGE 23

Base Table

fraction 0.01

Sample Tables ...

0.05 0.1 0.15 0.2 0.25 fraction 0.01 fraction 0.05 fraction 0.1 fraction 0.2 fraction 0.3

Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)

20,000 40,000 60,000 80,000 100,000 120,000 fraction 0.01 fraction 0.05 fraction 0.1 fraction 0.2fraction 0.3 SparkSQL

Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)

converge!

slide-24
SLIDE 24

Other Queries?

Insights of Approximate Query Processing Systems PAGE 24

Q19 Error: ~ 80% Speedup: ~5.5X Q14 Error: ~ 1.7% Speedup: ~1.7X

slide-25
SLIDE 25

Other Queries?

Insights of Approximate Query Processing Systems PAGE 25

Key missing in sample tables! Careful design of sample table

  • r original table!

Q7 AQP not working!

slide-26
SLIDE 26

Insights

Insights of Approximate Query Processing Systems PAGE 26

§ AQP performs well:

§ For aggregate functions such as SUM, AVG and COUNT § When WHERE is simple

§ Users’ foreseen is important!

§ for both query and original table

slide-27
SLIDE 27

Future Work

Insights of Approximate Query Processing Systems PAGE 27

§ Test error estimation in sampling § Other sampling techniques

§ Biased Sampling

§ Database learning § Approximate hardware