Research Collection https://doi.org/10.3929/ethz-b-000294819 - - PDF document

research collection
SMART_READER_LITE
LIVE PREVIEW

Research Collection https://doi.org/10.3929/ethz-b-000294819 - - PDF document

Research Collection https://doi.org/10.3929/ethz-b-000294819 information please consult the Terms of use. This page was generated automatically upon download from the ETH Zurich Research Collection. For more Creative Commons Attribution 4.0


slide-1
SLIDE 1

Research Collection

Conference Paper

Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

Author(s): Marroquín, Renato; Müller, Ingo; Makreshanski, Darko; Alonso, Gustavo Publication Date: 2018-10 Permanent Link: https://doi.org/10.3929/ethz-b-000294819 Originally published in: http://doi.org/10.1145/3267809.3267822 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

slide-2
SLIDE 2

| |

Renato Marroquín, Ingo Müller, Darko Makreshanski, Gustavo Alonso

Marroquín, Müller, Makreshanski, Alonso 1

Pay One, Get Hundreds for Free:

Reducing Cloud Costs through Shared Query Execution

13.10.2018

slide-3
SLIDE 3

| |

  • Pros
  • No need to move data outside

the cloud

  • No infrastructure deployment
  • No database maintenance
  • Pricing model
  • Pay-per-byte-processed

Marroquín, Müller, Makreshanski, Alonso 2

Motivation: Query-as-a-service

13.10.2018

  • Disadvantages
  • Expensive with frequent usage
  • No intuitive way to optimize
  • Cost
  • Throughput
  • Popular Systems
slide-4
SLIDE 4

| | Marroquín, Müller, Makreshanski, Alonso 3

Motivation: Use cases

  • Observations:
  • High overlap among concurrent queries
  • Query burst from single user
  • Similar workload from multiple users

Work sharing: carry out redundant work only once.

Parameter exploration BI reporting Ad-hoc analytics

13.10.2018

slide-5
SLIDE 5

| | Marroquín, Müller, Makreshanski, Alonso 4

Our Goal

13.10.2018

Reduce query-as-a-service costs using work sharing techniques through query rewriting

slide-6
SLIDE 6

| |

  • Related work
  • Multi-query execution
  • Evaluation
  • Summary
  • Conclusions

Marroquín, Müller, Makreshanski, Alonso 5

Outline

13.10.2018

slide-7
SLIDE 7

| | Marroquín, Müller, Makreshanski, Alonso 6

Multi Query Execution: General sharing workflow

Collect Co-plan Co-execute Separate

13.10.2018

Our Focus

slide-8
SLIDE 8

| | Marroquín, Müller, Makreshanski, Alonso 7

Shared execution techniques

1988 – 2005 2005 – 2010 2010 – 2017 1st generation: Crescando 2nd generation: SharedDB, MQJoin, BatchDB

  • Annotate tuples with

query_id attribute

  • Supports all relational
  • perators

Name Age Sex query_id Renato 30 m Q1,Q3 Darko 30 m Q1,Q2,Q3 Ingo 32 m Q1

MQO Scan Sharing Work Sharing

13.10.2018

slide-9
SLIDE 9

| |

MQO Scan Sharing Work Sharing

Marroquín, Müller, Makreshanski, Alonso 8

Shared execution techniques

1988 – 2005 2005 – 2010 2010 – 2017

  • Previous approaches need dedicated engine
  • High implementation effort
  • Needs vendor support

This work: rewrite shared query plans as SQL.

13.10.2018

slide-10
SLIDE 10

| |

  • Related work
  • Multi-query execution

Marroquín, Müller, Makreshanski, Alonso 9

Outline

13.10.2018

slide-11
SLIDE 11

| | Marroquín, Müller, Makreshanski, Alonso 10

Multi Query Execution: Data-query model

Name Age Sex Renato 30 m Darko 30 m Ingo 32 m Gustavo 50 m SELECT * FROM Emp WHERE … Q1: … age < 40 Q2: … name LIKE ‘%k%’ Q3: … age <= 30 AND age > 20 Name Age Sex query_id Renato 30 m { 1, 3 } Darko 30 m { 1, 2, 3 } Ingo 32 m { 1 } Gustavo 50 m { }

13.10.2018

slide-12
SLIDE 12

| | Marroquín, Müller, Makreshanski, Alonso 11

Multi Query Execution: The query_id attribute

Name: VARCHAR Age: INT Sex: ENUM query_id: INTEGER ARRAY Renato 30 m { 1, 3 } Darko 30 m { 1, 2, 3 } Ingo 32 m { 1 }

  • Result of Qx is given by

SELECT * FROM R WHERE x = ANY(query_id)

  • This is standard SQL!

13.10.2018

slide-13
SLIDE 13

| | Marroquín, Müller, Makreshanski, Alonso 12

Shared operators: Shared scan

SELECT *, ARRAY_REMOVE( ARRAY[ CASE WHEN age < 40 THEN 1 ELSE 0 END, CASE WHEN name LIKE ‘%k%’ THEN 2 ELSE 0 END, CASE WHEN age <= 30 AND age > 20 THEN 3 ELSE 0 END ] , 0) AS query_id FROM Emp WHERE (age < 40) OR (name LIKE ‘%k%’) OR (age <= 30 AND age > 20) Name Age Sex Renato 30 m Darko 30 m Ingo 32 m Gustavo 50 m SELECT * FROM Emp WHERE … Q1: … age < 40 Q2: … name LIKE ‘%k%’ Q3: … age <= 30 AND age > 20

13.10.2018

slide-14
SLIDE 14

| | Marroquín, Müller, Makreshanski, Alonso 13

Shared operators: Shared join

SELECT * FROM Emp e JOIN Dep d ON e.did = d.did WHERE … Name Age Sex D_ID query_id Renato 30 m 1 { 1, 3 } Darko 30 m 1 { 1, 2, 3 } Ingo 32 m 1 { 1 } D_ID D_Name query_id 1 Systems { 2, 3 } 2 Algorithms { 2 } 3 ML { 1, 2, 3 } Name Age Sex D_ID D_Name query_id Renato 30 m 1 Systems { 3 } Darko 30 m 1 Systems { 2, 3 }

Employees Departments

13.10.2018

slide-15
SLIDE 15

| | Marroquín, Müller, Makreshanski, Alonso 14

Shared operators: Shared join

WITH shared_emp AS (…), -- shared scan on emp shared_dep AS (…), -- shared scan on dep shared_join_helper AS ( SELECT R.name, R.age, R.sex, S.d_id, S.d_name, ARRAY_INTERSECT(R.query_id, S.query_id) AS query_id FROM shared_emp e JOIN shared_dep d ON e.d_id = d.d_id) SELECT * FROM shared_join_helper WHERE CARDINALITY(query_id) > 0 Name Age Sex D_ID query_id Renato 30 m 1 { 1, 3 } Darko 30 m 1 { 1, 2, 3 } Ingo 32 m 1 { 1 } D_ID D_Name query_id 1 Systems { 2, 3 } 2 Algorithms { 2 } 3 ML { 1, 2, 3 }

Employees Departments

13.10.2018

slide-16
SLIDE 16

| |

  • GROUP BY
  • Use UNNEST to duplicate tuples for each query
  • Then group by query_id
  • LIMIT / TOP K
  • PARTITION BY query_id
  • Then filter by RANK()

Marroquín, Müller, Makreshanski, Alonso 15

Shared operators: Other shared operators

Enough to express all TPC-H queries!

13.10.2018

slide-17
SLIDE 17

| |

  • Related work
  • Multi-query execution
  • Evaluation
  • Summary
  • Conclusions

Marroquín, Müller, Makreshanski, Alonso 16

Outline

13.10.2018

slide-18
SLIDE 18

| | Marroquín, Müller, Makreshanski, Alonso 17

Evaluation: Query-as-a-Service systems

  • Systems-under-test
  • Amazon Athena
  • Google BigQuery
  • Run SQL queries against files in cloud storage
  • Apache Parquet
  • Google internal columnar storage
  • Micro-benchmarks
  • End-to-end query execution

13.10.2018

slide-19
SLIDE 19

| |

  • Data
  • Lineitem from TPC-H Scale Factor 100
  • Workload
  • Parameterized queries
  • Queries in a group/batch have equal selectivity

Marroquín, Müller, Makreshanski, Alonso 18

Evaluation: Shared scan

13.10.2018

slide-20
SLIDE 20

| | Marroquín, Müller, Makreshanski, Alonso 19

Evaluation: Shared scan

Google BigQuery

84GB table, native columnar format

Execution time increases sublinearly with query count.

13.10.2018

Queries Selectivity:

slide-21
SLIDE 21

| |

Amazon Athena

  • Pay per processed byte

Google Big Query: constant 0.011$

Marroquín, Müller, Makreshanski, Alonso 20

Evaluation: Shared scan

27GB, Parquet format

Monetary cost (almost) independent of batch size.

13.10.2018

Queries Selectivity:

slide-22
SLIDE 22

| |

  • Data
  • TPC-H Scale Factor 100
  • Simulate multiple apps interacting with QaaS
  • Each emitting different TPC-H queries
  • Workload
  • 128 instantiations of each query type

Marroquín, Müller, Makreshanski, Alonso 21

Evaluation: TPC-H

13.10.2018

slide-23
SLIDE 23

| | Marroquín, Müller, Makreshanski, Alonso 22

Evaluation: TPC-H

Considerable speedup: up to ~128x.

1 10 100 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

Throughput Improvement TPC-H Query

Throughput improvement over Query-at-a-Time (batch size = 128)

Athena BigQuery

13.10.2018

slide-24
SLIDE 24

| | Marroquín, Müller, Makreshanski, Alonso 23

Evaluation: TPC-H

Again: cost (almost) independent of batch size.

13.10.2018

$80 <$1 $240 ~$14 $227 ~$1.7

0.10 1.00 10.00 100.00 1000.00

Athena BigQuery (all) BigQuery(without Q10)

Cost [$]

Query-at-a-time SharedExecution

Cost compared to Query-at-a-Time (batch size = 128)

128x 16x 107x

slide-25
SLIDE 25

| |

  • Related work
  • Multi-query execution
  • Evaluation
  • Summary
  • Conclusions

Marroquín, Müller, Makreshanski, Alonso 24

Outline

13.10.2018

slide-26
SLIDE 26

| | Marroquín, Müller, Makreshanski, Alonso 25

  • Shared query execution through query rewriting
  • Rewriting can be done with standard SQL
  • Evaluate our approach by executing queries end-to-end
  • Show improvements in execution cost and throughput

Summary Conclusion

  • Sharing can be implemented in:
  • Client
  • Library
  • Middleware

13.10.2018

slide-27
SLIDE 27

| | Marroquín, Müller, Makreshanski, Alonso 26

Thanks!

13.10.2018

slide-28
SLIDE 28

| |

Backup Slides

Marroquín, Müller, Makreshanski, Alonso 27 13.10.2018

slide-29
SLIDE 29

| | Marroquín, Müller, Makreshanski, Alonso 28

Evaluation: TPC-H

0.1 1 10 100 Athena BigQuery (all) BigQuery (without Q10)

Cost Improvement

Relative cost compared to Query-at-a-Time (batch size = 128)

Query-at-a-time SharedExecution

Again: cost (almost) independent of batch size.

128x 16x 107x

13.10.2018

$80 <$1 $240 ~$14 $227 ~$1.7

slide-30
SLIDE 30

| |

MQO Scan Sharing Work Sharing

Marroquín, Müller, Makreshanski, Alonso 29

Shared execution techniques

1988 – 2005 2005 – 2010 2010 – 2017 First generation: Crescando

  • Index predicates to

reduce work

  • Lower complexity of work
  • Only scans

idx q1 q2 … q3 row value

13.10.2018

slide-31
SLIDE 31

| | 30.05.2018 Marroquín, Müller, Makreshanski, Alonso 30

Shared plans

e_id Sal Q3,Q4

𝜏

Emp Dep

d_id Q1,Q2

sex = “m”

𝜏

Emp Dep ⨝ d_id Q1 age > 30

𝜏

Emp Dep ⨝ d_id Q2 sex = “f”

𝜏

Emp Sal ⨝ e_id Q3 age < 50

𝜏

Emp Sal ⨝ e_id Q4

slide-32
SLIDE 32

| | 30.05.2018 Marroquín, Müller, Makreshanski, Alonso 31

Shared plans

  • SQL can only express trees
  • Materialize nodes with

multiple consumers

T_emp := SELECT … FROM Emp … T_34 := SELECT … FROM T_emp … T_12 := SELECT … FROM T_emp … R1 := SELECT … FROM T_12 … R2 := SELECT … FROM T_12 … R3 := SELECT … FROM T_34 … R4 := SELECT … FROM T_34 …

𝜏

Emp Dep

d_id

e_id Sal Q3,Q4 Q1,Q2

slide-33
SLIDE 33

| | 30.05.2018 Marroquín, Müller, Makreshanski, Alonso 32

Shared plan: Evaluation

500 1000 1500 2000 2500

Execution Time [secs]

customer ⨝orders ⨝ lineitem ⨝supplier lineitem ⨝supplier

  • rders ⨝customer

part ⨝lineitem part ⨝partsupp supplier ⨝nation QueryExec.

slide-34
SLIDE 34

| | 30.05.2018 Marroquín, Müller, Makreshanski, Alonso 33

Shared plan: Evaluation

0.1 1 10 100 1 32 64 128

ThrouputImprovement Batch Size

LogicalPlanPerType SingleLogicalPlan

slide-35
SLIDE 35

| | 30.05.2018 Marroquín, Müller, Makreshanski, Alonso 34

Evaluation: Shared scan – Amazon Athena

Amazon Athena

27GB table, Apache Parquet format

Execution time increases sublinearly with query count.

1 10 100 1 2 4 8 16 32 64 128

Execution Time [secs] Batch Size

0.01% 0.10% 1% 10% 99%

slide-36
SLIDE 36

| | Marroquín, Müller, Makreshanski, Alonso 35

Shared operators: Predicate Indexing

SELECT *, CASE WHEN age < 40 THEN CASE WHEN age <= 30 AND age > 20 THEN CASE WHEN name LIKE ‘%k%’ THEN ARRAY[1,2,3] ELSE ARRAY[1,3] END ELSE CASE WHEN name LIKE ‘%k%’ THEN ARRAY[1,2] ELSE ARRAY[1] END ELSE CASE WHEN name LIKE ‘%k%’ THEN ARRAY[2] ELSE ARRAY[] END, END

AS query_id

FROM Emp WHERE (age < 40) OR (name LIKE ‘%k%’) OR (age <= 30 AND age > 20)

Name Age Sex Renato 30 m Darko 30 m Ingo 32 m Gustavo 50 m SELECT * FROM Emp WHERE … Q1: … age < 40 Q2: … name LIKE ‘%k%’ Q3: … age <= 30 AND age > 20

13.10.2018

slide-37
SLIDE 37

| |

Google BigQuery

  • Pay per processed byte

Marroquín, Müller, Makreshanski, Alonso 36

Evaluation: Shared scan + Predicate Indexing

84GB, Parquet format

  • Exec. time does not increase linearly with # of queries

2.5 5 7.5 10 12.5 15 17.5 1 2 4 8 16 32 64 128

Execution Time [secs] Batch Size LinearEval. MultiAttr.Idx.

13.10.2018