Insights of Approximate Query Processing Systems
Presented by: Huanyi Chen Ruoxi Zhang
Insights of Approximate Query Processing Systems Presented by: - - PowerPoint PPT Presentation
Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda Introduction Background VerdictDB & SnappyData Experiment Setup Evaluation Insights Insights of Approximate Query Processing
Presented by: Huanyi Chen Ruoxi Zhang
§ Introduction § Background § VerdictDB & SnappyData § Experiment Setup § Evaluation § Insights
Insights of Approximate Query Processing Systems PAGE 2
Insights of Approximate Query Processing Systems PAGE 3
# of Day Income (CAD) 1 150 2 240 3 180 4 200 5 230 6 190 7 180 Avg(Income)
shop income
Insights of Approximate Query Processing Systems PAGE 4
# of Day Income (CAD) 1 150 2 240 3 180 4 200 5 230 6 190 7 180 Avg(Income)
shop income
more efficient (50% rows) accuracy > 95%
Insights of Approximate Query Processing Systems PAGE 5
Insights of Approximate Query Processing Systems PAGE 6
Insights of Approximate Query Processing Systems PAGE 7
Query Column Set (QCS)
Insights of Approximate Query Processing Systems PAGE 8
Name Online/ Offline Distributed/ Standalone Platform Algorithm Skewed BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes Sapprox Online Distributed Hadoop Distribution-aware Online sampling No Approxhadoop Online Distributed Hadoop Approximation-enabled MapReduce No Quickr Online Distributed N/A ASALQA algorithm No SnappyData Online Distributed Spark and GemFire Spark as a computational engine; GemFire as transactional store No FluoDB Online Distributed Spark Mini-batch execution OLA Model No XDB Online Standalone PostgreSQL Wander join No VerdictDB Online Standalone Spark SQL Database learning No IDEA Online Standalone N/A Reuse answers of past
query No BEAS Online Standalone Commercial DBMS Approximability theorem No ABS Online Standalone N/A Bootstrap No
Insights of Approximate Query Processing Systems PAGE 9
SDE is NOT
source
Insights of Approximate Query Processing Systems PAGE 10
+ WITH ERROR QCS FRACTION
Insights of Approximate Query Processing Systems PAGE 11
Insights of Approximate Query Processing Systems PAGE 12
Insights of Approximate Query Processing Systems PAGE 13
§ Cluster Setup
§ SnappyData: 1 locator, 1 lead, and 2 servers
Insights of Approximate Query Processing Systems PAGE 14
§ Cluster Setup
§ SnappyData: 1 locator, 1 lead, and 2 servers § VerdictDB on Spark: 1 master and 2 executors
§ Each Node
§ 24/32 GB memory used § 500 GB HDD
Insights of Approximate Query Processing Systems PAGE 15
§ TPC-H Benchmark
§ OLAP § 22 queries includes Aggregation, Join, etc. § Well known and standard § Customizable
§ Data
§ 1GB and 10GB § Uniformly distributed
Insights of Approximate Query Processing Systems PAGE 16
Insights of Approximate Query Processing Systems PAGE 17
29439 1832 6629
8092 1399 3870 5,000 10,000 15,000 20,000 25,000 30,000 35,000
Q1 Q6 Q14
Execution time (ms) using TPC-H (SF=10, fraction 0.1)
SnappyData SnappyData_AQP (>95% accuracy)
Q1: Up to 3.6x speedup ~0.0001 Error
fraction 0.1
Insights of Approximate Query Processing Systems PAGE 18
Base Table
fraction 0.01
Sample Tables ...
0.00 0.00 0.01 0.01 0.02 0.02 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)
1,000 2,000 3,000 4,000 5,000 6,000 7,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Snappy
Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)
Insights of Approximate Query Processing Systems PAGE 19
50,000 100,000 150,000 200,000 250,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Time (ms) for creating SnappyData sample tables with different fractions
Insights of Approximate Query Processing Systems PAGE 20
206034 88912 99195 17598 15210 24355 50,000 100,000 150,000 200,000 250,000 Q1 Q6 Q14
Execution time (ms) using TPC-H (SF=10, fraction 0.1)
SparkSQL VerdictDB (> 95% accuracy)
Up to ~11x speedup!
Insights of Approximate Query Processing Systems PAGE 21
2 4 6 8 10 12 14 Q1 Q6 Q14
Speedup for TPC-H (SF=10, fraction=0.1)
Speedup 2 4 6 8 10 12 Q1 Q6 Q14
Speedup for TPC-H (SF=1, fraction=0.1)
Speedup
Insights of Approximate Query Processing Systems PAGE 22
100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Time (ms) for creating VerdictDB sample tables with different fraction
fraction 0.1
Insights of Approximate Query Processing Systems PAGE 23
Base Table
fraction 0.01
Sample Tables ...
0.05 0.1 0.15 0.2 0.25 fraction 0.01 fraction 0.05 fraction 0.1 fraction 0.2 fraction 0.3
Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)
20,000 40,000 60,000 80,000 100,000 120,000 fraction 0.01 fraction 0.05 fraction 0.1 fraction 0.2fraction 0.3 SparkSQL
Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)
converge!
Insights of Approximate Query Processing Systems PAGE 24
Q19 Error: ~ 80% Speedup: ~5.5X Q14 Error: ~ 1.7% Speedup: ~1.7X
Insights of Approximate Query Processing Systems PAGE 25
Key missing in sample tables! Careful design of sample table
Q7 AQP not working!
Insights of Approximate Query Processing Systems PAGE 26
§ AQP performs well:
§ For aggregate functions such as SUM, AVG and COUNT § When WHERE is simple
§ Users’ foreseen is important!
§ for both query and original table
Insights of Approximate Query Processing Systems PAGE 27
§ Test error estimation in sampling § Other sampling techniques
§ Biased Sampling
§ Database learning § Approximate hardware