Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - - PowerPoint PPT Presentation

autonomous etl with materialized views
SMART_READER_LITE
LIVE PREVIEW

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - - PowerPoint PPT Presentation

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views


slide-1
SLIDE 1

Autonomous ETL With Materialized Views

Abhishek Somani, Adesh Rao

May 2018

slide-2
SLIDE 2

2

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-3
SLIDE 3

3

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-4
SLIDE 4

4

  • Partitioning

Data structuring for SQL-on-Hadoop

slide-5
SLIDE 5

5

  • Columnar File Formats

Data organization for SQL-on-Hadoop

ORC Parquet

slide-6
SLIDE 6

6

  • Sorting
  • Bucketing

Data organization for SQL-on-Hadoop

slide-7
SLIDE 7

7

Data organization for SQL-on-Hadoop

Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000

slide-8
SLIDE 8

8

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-9
SLIDE 9

9

Difficulties in Structuring Data

  • Evolving query patterns
  • Data pipeline dependencies
  • Large number of consumers
  • Data Admin Involvement
  • Downtime
  • Workload Aware identification
  • f optimal data structure
  • Flexibility of data structuring
  • Seamless restructuring
  • Continuous and automatic

maintenance NO DOWNTIME!

slide-10
SLIDE 10

10

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-11
SLIDE 11

11

  • A materialized view is a database object that contains the results of a query.
  • It is a view for which the data has been materialized.
  • Materialized Views can be consumed automatically by the query engine

Example:

CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;

Effect: Query rewrite

SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv;

Basics: Materialized View

slide-12
SLIDE 12

12

Interesting properties of Materialized Views in Hive:

  • A copy of the data(full, partial or transformed)
  • Used automatically by the engine based on cost analysis
  • Can be stored as ORC, Parquet etc
  • Multiple materialized views can co-exist, optimally chosen

Plus: Storage is cheap Idea: Create multiple materialized views of the full data with desired structures

Materialized Views in Hive for Data Restructuring

slide-13
SLIDE 13

13

Query1: SELECT * from T1 where customer_id = 26988 and month

= “January”;

Rewritten: SELECT * from MV1 where customer_id = 26988 and

month = “January”;

Query2: SELECT * from T1 where seller_id = 121 and month =

“January”;

Rewritten: SELECT * from MV2 where seller_id = 121 and month =

“January”;

Materialized Views for Data Restructuring

Example:

Original Table T1:

  • Partitioned on Year, Month, Day
  • Stored as Text

Materialized View MV1:

  • Partitioned on Year, Month, Day
  • Sorted on Customer_Id
  • Stored as ORC

Materialized View MV2:

  • Partitioned on Year, Month, Day
  • Sorted on Seller_Id
  • Stored as ORC
slide-14
SLIDE 14

14

Materialized Views in SQL-on-Hadoop engines

  • Basic implementation available in Apache Hive 2.3.0

○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries

  • Not available in Presto
  • Not available in Spark
slide-15
SLIDE 15

15

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-16
SLIDE 16

16

Challenges with Materialized Views

  • Invalidation

○ Only a subset of use cases can work with stale data

  • Rebuilds and Refreshes

○ Prohibitively expensive for full data copies

  • Maintenance Isolation

○ Ongoing queries get affected

slide-17
SLIDE 17

17

Agenda

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-18
SLIDE 18

18

FastCopy: A framework for Autonomous Materialized Views

  • Materialized Views for Sorting, Partitioning and Bucketing for structuring data
  • Synchronous invalidation on table updates
  • Asynchronous automatic refreshes
  • Maintenance isolation by refreshes in their own scheduler queues, or even

their own cluster

  • Recommendation Engine to suggest Materialized Views
  • Cross engine support for using Materialized Views
slide-19
SLIDE 19

19

Qubole FastCopy Infrastructure

slide-20
SLIDE 20

20

Qubole FastCopy Infrastructure

FastCopy Creation

slide-21
SLIDE 21

21

Qubole FastCopy Infrastructure

FastCopy Creation

slide-22
SLIDE 22

22

Qubole FastCopy Infrastructure

FastCopy Creation

slide-23
SLIDE 23

23

Qubole FastCopy Infrastructure

FastCopy Creation FastCopy Creation

slide-24
SLIDE 24

24

Qubole FastCopy Infrastructure

Incoming query for rewrite

slide-25
SLIDE 25

25

Qubole FastCopy Infrastructure

Query Rewrite

slide-26
SLIDE 26

26

Qubole FastCopy Infrastructure

Query Rewrite

slide-27
SLIDE 27

27

Qubole FastCopy Infrastructure

Query Rewrite

slide-28
SLIDE 28

28

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-29
SLIDE 29

29

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-30
SLIDE 30

30

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-31
SLIDE 31

31

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-32
SLIDE 32

32

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-33
SLIDE 33

33

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-34
SLIDE 34

34

Qubole FastCopy Infrastructure

Invalidation and Refresh

slide-35
SLIDE 35

35

Fun Details

  • Auto detect added, dropped or updated partitions using partition level tokens
  • Multi Version Concurrency Control for FastCopy
  • Minion clusters for workload isolation
slide-36
SLIDE 36
  • Top Tables

36

Recommendations

slide-37
SLIDE 37
  • Top Tables

37

Recommendations

slide-38
SLIDE 38
  • Column Usage as Filter predicates

38

Recommendations

slide-39
SLIDE 39
  • Column Usage as Filter predicates

39

Recommendations

slide-40
SLIDE 40
  • Column Usage as Filter predicates

40

Recommendations

slide-41
SLIDE 41
  • Top Tables

41

Recommendations

slide-42
SLIDE 42
  • Top Tables

42

Recommendations

slide-43
SLIDE 43
  • Column Usage as Filter predicates

43

Recommendations

slide-44
SLIDE 44
  • Column Usage as Filter predicates

44

Recommendations

slide-45
SLIDE 45
  • Column Usage as Filter predicates

45

Recommendations

slide-46
SLIDE 46

46

Revise

1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

slide-47
SLIDE 47

47

Status

  • FastCopy is at an internal Alpha
  • Will soon be released as a beta for customers in the next Quarter
  • Contribute to Open Source

Thank You

slide-48
SLIDE 48

Thank You

Abhishek Somani, Adesh Rao

May 2018