Tactical data engineering Julian Hyde April 1718, 2019 San - - PowerPoint PPT Presentation

tactical data engineering
SMART_READER_LITE
LIVE PREVIEW

Tactical data engineering Julian Hyde April 1718, 2019 San - - PowerPoint PPT Presentation

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems 1. DBMS query program


slide-1
SLIDE 1

Tactical data engineering

Julian Hyde

April 17–18, 2019 San Francisco

slide-2
SLIDE 2

@julianhyde

slide-3
SLIDE 3

DBMS Data pipeline & analytics DBMS tricks Tactical data engineering Evolving the data pipeline Adaptive data systems

slide-4
SLIDE 4
  • 1. DBMS
slide-5
SLIDE 5

File system vs. DBMS

file 1 file 2 program table 1 query table 2

slide-6
SLIDE 6

File system vs. DBMS

file 1 file 2 program file 1 file 2 program query

slide-7
SLIDE 7

Efficient join: reorganize the data and rewrite the program

sorted file 1 sorted file 2

program file 1 file 2 program (merge join) query

sorted file 2

slide-8
SLIDE 8
  • Abstraction
  • Declarative language
  • Planning
  • Easily reorganize data, add new algorithms
  • Governance
  • Metadata
  • Security

And, I propose:

  • Adaptability

DBMS adds value

slide-9
SLIDE 9
  • 2. Data pipeline
slide-10
SLIDE 10

The data pipeline: Extract - Load - Transform

Cloud DB

table table table table table table

source source

slide-11
SLIDE 11

The data pipeline: Extract - Load - Transform

Cloud DB

table table table table table table

source source SQL query business query interactive users

slide-12
SLIDE 12

File system vs. DBMS

file 1 file 2 program file 1 file 2 program query

slide-13
SLIDE 13

File system vs. DBMS vs. analytic data system

file 1 file 2 program file 1 file 2 program query Cloud DB

table table table table table table

SQL query business query

slide-14
SLIDE 14

File system vs. DBMS vs. analytic data system

file 1 file 2 program file 1 file 2 program query Cloud DB

table table table table table table

SQL query business query business users analysts programmers

slide-15
SLIDE 15
  • 3. DBMS tricks
slide-16
SLIDE 16

Re-organize data

a 1 c 3 c 4 b 2 a .. b c .. c

Index

a 1 b 2 c 3 c 4

Sort Raw data

a 1 c 4 c 3 b 2 a 1 c 3 c 4 b 2

Partition

a 1 c 3 c 4 b 2

Replicate

a 1 1 c 7 2 b 2 1

Summarize

slide-17
SLIDE 17

Caching

a 1 c 3 c 4 b 2

Raw data

a 1 c 3 c 4 b 2

Copy of data in memory

slide-18
SLIDE 18

Apache Calcite

Apache top-level project Query planning framework used in many projects and products Also works standalone: federated query engine with SQL / JDBC front end Apache community development model calcite.apache.org github.com/apache/calcite

slide-19
SLIDE 19

SELECT d.name, COUNT(*) AS c FROM Emps AS e JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC

Relational algebra

Based on set theory, plus

  • perators: Project, Filter, Aggregate,

Union, Join, Sort Requires: declarative language (SQL), query planner Original goal: data independence Enables: query optimization, new algorithms and data structures

Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]

slide-20
SLIDE 20

SELECT d.name, COUNT(*) AS c FROM (SELECT * FROM Emps WHERE e.age > 50) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC

Algebraic rewrite

Optimize by applying rewrite rules that preserve semantics Hopefully the result is less expensive; but it’s OK if it’s not (planner keeps “before” and “after”) Planner uses dynamic programming, seeking the lowest total cost

Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age > 50] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]

slide-21
SLIDE 21

SELECT deptno, MIN(salary) FROM Managers WHERE age > 50 GROUP BY deptno

Views

Scan [Emps] Scan [Emps] Join [e.id = underling.manager] Project [id, deptno, salary, age] Aggregate [manager]

CREATE VIEW Managers AS SELECT * FROM Emps AS e WHERE EXISTS ( SELECT * FROM Emps AS underling WHERE underling.manager = e.id)

Filter [age > 50] Aggregate [deptno, MIN(salary)] Scan [Managers]

slide-22
SLIDE 22

SELECT deptno, MIN(salary) FROM Managers WHERE age > 50 GROUP BY deptno

View query (after expansion)

Scan [Emps] Scan [Emps] Join [e.id = underling.manager] Project [id, deptno, salary, age] Aggregate [manager]

CREATE VIEW Managers AS SELECT * FROM Emps AS e WHERE EXISTS ( SELECT * FROM Emps AS underling WHERE underling.manager = e.id)

Filter [age > 50] Aggregate [deptno, MIN(salary)]

slide-23
SLIDE 23

CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender

Materialized view

Scan [Emps]

SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’

Filter [deptno = 10 AND gender = ‘M’] Aggregate [COUNT(*)] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)]

slide-24
SLIDE 24

CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender

Materialized view: rewrite query to match

Scan [Emps]

SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’

Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c]

slide-25
SLIDE 25

CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender

Materialized view: rewrite query to match

Scan [Emps]

SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’

Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c]

slide-26
SLIDE 26

CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender

Materialized view: substitute table scan

SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’

Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c] Scan [EmpSummary]

slide-27
SLIDE 27

CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender

Materialized view: substitute table scan

SELECT c FROM EmpSummary WHERE deptno = 10 AND gender = ‘M’

Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c] Scan [EmpSummary]

slide-28
SLIDE 28
  • 4. Analytics
slide-29
SLIDE 29

“orders” view in LookML

view: orders { dimension: id { primary_key: yes type: number sql: ${TABLE}.id ;; } dimension: customer_id { # field: orders.customer_id sql: ${TABLE}.customer_id ;; } dimension: amount { # field: orders.amount type: number value_format: "0.00" sql: ${TABLE}.amount ;; } measure: count { # field: orders.count type: count # creates a sql COUNT(*) } measure: total_amount { type: sum sql: ${amount} ;; } }

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
  • 5. Evolving the data

pipeline

slide-34
SLIDE 34

Cloud DB

table table table table table table

source source SQL query business query interactive users

slide-35
SLIDE 35

Data engineering

table table table table table table

slide-36
SLIDE 36

Data engineering is not a static problem

table table table table table table table

file

In memory table table

slide-37
SLIDE 37

data engineer

Who is responsible for data engineering?

table table table table table table table

file

In memory table table

slide-38
SLIDE 38

system

(runtime adaptation) data scientist

analyst data engineer

Data engineering - empower users, reduce friction

table table table table table table table

file

In memory table table

slide-39
SLIDE 39

LookML - derived table (based

  • n SQL)

view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; } }

slide-40
SLIDE 40

LookML - derived table (based

  • n an Explore)

view: customer_order_facts { derived_table: { explore_source: orders { column: customer_id { field: order.customer_id } column: first_order { field: order.first_order } column: lifetime_amount { field: order.lifetime_amount } } } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; }

slide-41
SLIDE 41

Flavors of derived table

Derived table flavor Purpose SQL equivalent Ephemeral Query expansion CREATE VIEW Persistent Query is executed once, used by several queries until it expires CREATE TABLE AS SELECT Transparent Populated as persistent DT, but can be used even if the business query does not reference it by name CREATE MATERIALIZED VIEW Each flavor comes can be based on either an Explore or SQL

slide-42
SLIDE 42

Building materialized views

Challenges:

  • Design Which materializations to create?
  • Populate Load them with data
  • Maintain Incrementally populate when data changes
  • Rewrite Transparently rewrite queries to use materializations
  • Adapt Design and populate new materializations, drop unused ones
  • Express Need a rich algebra, to model how data is derived

Initial focus: summary tables (materialized views over star schemas)

slide-43
SLIDE 43

CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId);

Designing summary tables via lattices

CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3;

product product class sales

customers

time

slide-44
SLIDE 44

Many possible summary tables

Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)

() 1

(z, s, g, y, m) 912k

(s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60 (g, y) 10 (z, s) 43.4k (g, y, m) 120

Fewer than you would expect, because 5m combinations cannot

  • ccur in 1m row table

Fewer than you would expect, because state depends on zipcode

slide-45
SLIDE 45

Algorithm: Design summary tables

Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]:

  • Based on research [2]
  • Greedy algorithm that takes a combination of summary tables and tries to

find the table that yields the greatest cost/benefit improvement

  • Models “benefit” of the table as query time saved over simulated query load
  • The “cost” of a table is its size

[1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”

slide-46
SLIDE 46

Lattice (optimized)

() 1

(z, s, g, y, m) 912k

(s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24

Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)

slide-47
SLIDE 47

system

(runtime adaptation) data scientist

analyst data engineer

Data engineering - empower users, reduce friction

table table table table table table table

file

In memory table table

slide-48
SLIDE 48

data scientist system

(runtime adaptation)

analyst data engineer

Data engineering - productionize

table table table table table table table In memory table

file

table

slide-49
SLIDE 49

Adaptive data systems

queries DML statistics adaptations

recommender

Goals

  • Improve response time, throughput, storage cost
  • Predictable, adaptive (short and long term), allow human

intervention How?

  • Humans
  • Adaptive systems
  • Smart algorithms

Example adaptations

  • Cache disk blocks in memory
  • Cached query results
  • Data organization, e.g. partition on a different key
  • Secondary structures, e.g. b-tree and r-tree indexes
slide-50
SLIDE 50

Thank you! Any questions?

@julianhyde www.looker.com calcite.apache.org