[PPT] - Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao PowerPoint Presentation

SLIDE 1

Autonomous ETL With Materialized Views

Abhishek Somani, Adesh Rao

May 2018

SLIDE 2

2

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 3

3

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 4

4

Partitioning

Data structuring for SQL-on-Hadoop

SLIDE 5

5

Columnar File Formats

Data organization for SQL-on-Hadoop

ORC Parquet

SLIDE 6

6

Sorting
Bucketing

Data organization for SQL-on-Hadoop

SLIDE 7

7

Data organization for SQL-on-Hadoop

Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000

SLIDE 8

8

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 9

9

Difficulties in Structuring Data

Evolving query patterns
Data pipeline dependencies
Large number of consumers
Data Admin Involvement
Downtime
Workload Aware identification
f optimal data structure
Flexibility of data structuring
Seamless restructuring
Continuous and automatic

maintenance NO DOWNTIME!

SLIDE 10

10

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 11

11

A materialized view is a database object that contains the results of a query.
It is a view for which the data has been materialized.
Materialized Views can be consumed automatically by the query engine

Example:

CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;

Effect: Query rewrite

SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv;

Basics: Materialized View

SLIDE 12

12

Interesting properties of Materialized Views in Hive:

A copy of the data(full, partial or transformed)
Used automatically by the engine based on cost analysis
Can be stored as ORC, Parquet etc
Multiple materialized views can co-exist, optimally chosen

Plus: Storage is cheap Idea: Create multiple materialized views of the full data with desired structures

Materialized Views in Hive for Data Restructuring

SLIDE 13

13

Query1: SELECT * from T1 where customer_id = 26988 and month

= “January”;

Rewritten: SELECT * from MV1 where customer_id = 26988 and

month = “January”;

Query2: SELECT * from T1 where seller_id = 121 and month =

“January”;

Rewritten: SELECT * from MV2 where seller_id = 121 and month =

“January”;

Materialized Views for Data Restructuring

Example:

Original Table T1:

Partitioned on Year, Month, Day
Stored as Text

Materialized View MV1:

Partitioned on Year, Month, Day
Sorted on Customer_Id
Stored as ORC

Materialized View MV2:

Partitioned on Year, Month, Day
Sorted on Seller_Id
Stored as ORC

SLIDE 14

14

Materialized Views in SQL-on-Hadoop engines

Basic implementation available in Apache Hive 2.3.0

○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries

Not available in Presto
Not available in Spark

SLIDE 15

15

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 16

16

Challenges with Materialized Views

Invalidation

○ Only a subset of use cases can work with stale data

Rebuilds and Refreshes

○ Prohibitively expensive for full data copies

Maintenance Isolation

○ Ongoing queries get affected

SLIDE 17

17

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 18

18

FastCopy: A framework for Autonomous Materialized Views

Materialized Views for Sorting, Partitioning and Bucketing for structuring data
Synchronous invalidation on table updates
Asynchronous automatic refreshes
Maintenance isolation by refreshes in their own scheduler queues, or even

their own cluster

Recommendation Engine to suggest Materialized Views
Cross engine support for using Materialized Views

SLIDE 19

19

Qubole FastCopy Infrastructure

SLIDE 20

20

Qubole FastCopy Infrastructure

FastCopy Creation

SLIDE 21

21

Qubole FastCopy Infrastructure

FastCopy Creation

SLIDE 22

22

Qubole FastCopy Infrastructure

FastCopy Creation

SLIDE 23

23

Qubole FastCopy Infrastructure

FastCopy Creation FastCopy Creation

SLIDE 24

24

Qubole FastCopy Infrastructure

Incoming query for rewrite

SLIDE 25

25

Qubole FastCopy Infrastructure

Query Rewrite

SLIDE 26

26

Qubole FastCopy Infrastructure

Query Rewrite

SLIDE 27

27

Qubole FastCopy Infrastructure

Query Rewrite

SLIDE 28

28

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 29

29

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 30

30

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 31

31

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 32

32

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 33

33

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 34

34

Qubole FastCopy Infrastructure

Invalidation and Refresh

SLIDE 35

35

Fun Details

Auto detect added, dropped or updated partitions using partition level tokens
Multi Version Concurrency Control for FastCopy
Minion clusters for workload isolation

SLIDE 36

Top Tables

36

Recommendations

SLIDE 37

Top Tables

37

Recommendations

SLIDE 38

Column Usage as Filter predicates

38

Recommendations

SLIDE 39

Column Usage as Filter predicates

39

Recommendations

SLIDE 40

Column Usage as Filter predicates

40

Recommendations

SLIDE 41

Top Tables

41

Recommendations

SLIDE 42

Top Tables

42

Recommendations

SLIDE 43

Column Usage as Filter predicates

43

Recommendations

SLIDE 44

Column Usage as Filter predicates

44

Recommendations

SLIDE 45

Column Usage as Filter predicates

45

Recommendations

SLIDE 46

46

Revise

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

SLIDE 47

47

Status

FastCopy is at an internal Alpha
Will soon be released as a beta for customers in the next Quarter
Contribute to Open Source

Thank You

SLIDE 48

Thank You

Abhishek Somani, Adesh Rao

May 2018