Tactical data engineering Julian Hyde April 1718, 2019 San - PowerPoint PPT Presentation

Tactical data engineering Julian Hyde April 17–18, 2019 San Francisco

@julianhyde

DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems

1. DBMS

query program table 1 table 2 file 1 file 2 File system vs. DBMS

query program program file 1 file 2 file 1 file 2 File system vs. DBMS

query program program (merge join) sorted sorted file 1 file 2 file 1 file 2 sorted file 2 Efficient join: reorganize the data and rewrite the program

DBMS adds value Abstraction ● ● Declarative language ● Planning Easily reorganize data, add new algorithms ● Governance ● ● Metadata ● Security And, I propose: ● Adaptability

2. Data pipeline

Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform

interactive users business query SQL query Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform

query program program file 1 file 2 file 1 file 2 File system vs. DBMS

business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system

business programmers analysts users business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system

3. DBMS tricks

Re-organize data a 1 Raw data c 3 c 4 b 2 a 1 a 1 c 3 a 1 a 1 a 1 1 a .. b b 2 c 4 b 2 c 3 c 3 c 7 2 c .. c c 3 c 4 c 4 b 2 1 c 4 b 2 b 2 Index Sort Partition Replicate Summarize

Caching a 1 Raw data a 1 Copy of c 3 c 3 data in memory c 4 c 4 b 2 b 2

Apache Calcite Apache top-level project Query planning framework used in many projects and products Also works standalone: federated query engine with SQL / JDBC front end Apache community development model calcite.apache.org github.com/apache/calcite

SELECT d.name, COUNT (*) AS c FROM Emps AS e Relational algebra JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Based on set theory, plus ORDER BY c DESC operators: Project, Filter, Aggregate, Project [name, c] Union, Join, Sort Filter [c > 5] Requires: declarative language (SQL), query planner Aggregate [deptno, COUNT(*) AS c] Original goal: data independence Filter [e.age < 30] Enables: query optimization, new Join [e.deptno = d.deptno] algorithms and data structures Scan [Emps] Scan [Depts]

SELECT d.name, COUNT (*) AS c FROM ( SELECT * FROM Emps Algebraic rewrite WHERE e.age > 50) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Optimize by applying rewrite rules that ORDER BY c DESC preserve semantics Project [name, c] Hopefully the result is less expensive; Filter [c > 5] but it’s OK if it’s not (planner keeps “before” and “after”) Aggregate [deptno, COUNT(*) AS c] Planner uses dynamic programming, Join [e.deptno = d.deptno] seeking the lowest total cost Filter [e.age > 50] Scan [Emps] Scan [Depts]

Views Aggregate [deptno, MIN(salary)] SELECT deptno, MIN (salary) FROM Managers Filter [age > 50] WHERE age > 50 GROUP BY deptno Scan [Managers] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]

View query (after expansion) SELECT deptno, MIN (salary) FROM Managers WHERE age > 50 Aggregate [deptno, MIN(salary)] GROUP BY deptno Filter [age > 50] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]

Materialized view Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Aggregate [COUNT(*)] SELECT COUNT (*) AS c FROM Emps Filter [deptno = 10 AND gender = ‘M’] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]

Materialized view: rewrite query to match Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Aggregate [deptno, gender, COUNT(*), SUM(salary)] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]

Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’

Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT c FROM EmpSummary Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’

4. Analytics

view: orders { dimension: id { primary_key: yes type: number sql: ${TABLE}.id ;; } dimension: customer_id { # field: orders.customer_id sql: ${TABLE}.customer_id ;; } dimension: amount { # field: orders.amount type: number value_format: "0.00" sql: ${TABLE}.amount ;; } measure: count { # field: orders.count type: count # creates a sql COUNT(*) } measure: total_amount { type: sum sql: ${amount} ;; “orders” view in LookML } }

5. Evolving the data pipeline

interactive users business query SQL query Cloud DB source table table table table table table source

table table table table table table Data engineering

table table In memory table table table table table table table file Data engineering is not a static problem

data engineer table table In memory table table table table table table table file Who is responsible for data engineering?

data data engineer analyst system scientist (runtime adaptation) table table In memory table table table table table table table file Data engineering - empower users, reduce friction

view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; LookML - derived table (based } } on SQL)

view: customer_order_facts { derived_table: { explore_source: orders { column: customer_id { field: order.customer_id } column: first_order { field: order.first_order } column: lifetime_amount { field: order.lifetime_amount } } } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] LookML - derived table (based sql: ${TABLE}.first_order_date ;; } on an Explore)

Flavors of derived table Derived table flavor Purpose SQL equivalent Ephemeral Query expansion CREATE VIEW Persistent Query is executed once, used by CREATE TABLE AS several queries until it expires SELECT Transparent Populated as persistent DT, but CREATE MATERIALIZED can be used even if the business VIEW query does not reference it by name Each flavor comes can be based on either an Explore or SQL

Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)

Tactical data engineering Julian Hyde April 1718, 2019 San - PowerPoint PPT Presentation

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems 1. DBMS query program

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Tactical Exploitation Tactical Exploitation the other way to pen-test the other way to

Fullbacks Fullbacks Tactical and Positional analysis Tactical and Positional analysis Kieran

Delay assignment optimization strategies at pre-tactical and tactical levels A. Montlaur and L.

2012 National Service Conditions Championships Service/Tactical Pistol SERVICE/TACTICAL PISTOL

Tactical Analysis Board Games Marco Chiarandini Department of Mathematics & Computer Science

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

Tactical Imperatives A quick summary of key tactical initiatives that will: Improve your

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

Technical and Tactical Discussion October 2018 Sasha Rearick Skills of Successful WC skiers

Project Quicklook Final Presentation Tactical Satellite 3 System Design May 11, 2007 Team:

Update on the Finish SDR Program WInnComm Europe 2018 SDR Tactical Communications Workshop

CARMANAH SOLUTIONS FOR DEFENSE 1 Tactical Lighting Applications Over 40,000 Carmanah solar

Approach to Solve the AGC API Issue in the Tactical SDR Domain A Waveform Provider Perspective

Introduction to Tactical Generation with HPSG Woodley Packard University of Washington March 5,

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

Information Retrieval Lecture 4 Recap of last time Postings pointer storage Dictionary

Announcements Coordinating with other presenters Presentation length: ~20 minutes HW1

SLWWW at the NTCIR-13 WWW Task Peng XIAO , Yimeng FAN , Lingtao Li, Tetsuya Sakai Waseda

RelevanceFeedback CISC489/689010,Lecture#15 Monday,April13 th

How to Make Beautiful Technical Documents with LaTeX PHYS 87 Benjam n Grinstein UCSD

IB SL PAPER 1 Document Based Question Paper 1 1 Hour in Length 30% of IB-SL History Grade (25

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014

Exam Preparation Advice (for ANLP and in general) Shay Cohen (based on slides by Sharon

Tactical data engineering Julian Hyde April 1718, 2019 San - PowerPoint PPT Presentation

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems 1. DBMS query program

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Tactical Exploitation Tactical Exploitation the other way to pen-test the other way to

Fullbacks Fullbacks Tactical and Positional analysis Tactical and Positional analysis Kieran

Delay assignment optimization strategies at pre-tactical and tactical levels A. Montlaur and L.

2012 National Service Conditions Championships Service/Tactical Pistol SERVICE/TACTICAL PISTOL

Tactical Analysis Board Games Marco Chiarandini Department of Mathematics &amp; Computer Science

Tactical and Strategic AI Marco Chiarandini Department of Mathematics &amp; Computer Science

Tactical Imperatives A quick summary of key tactical initiatives that will: Improve your

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

Technical and Tactical Discussion October 2018 Sasha Rearick Skills of Successful WC skiers

Project Quicklook Final Presentation Tactical Satellite 3 System Design May 11, 2007 Team:

Update on the Finish SDR Program WInnComm Europe 2018 SDR Tactical Communications Workshop

CARMANAH SOLUTIONS FOR DEFENSE 1 Tactical Lighting Applications Over 40,000 Carmanah solar

Approach to Solve the AGC API Issue in the Tactical SDR Domain A Waveform Provider Perspective

Introduction to Tactical Generation with HPSG Woodley Packard University of Washington March 5,

Tactical and Strategic AI Marco Chiarandini Department of Mathematics &amp; Computer Science

Information Retrieval Lecture 4 Recap of last time Postings pointer storage Dictionary

Announcements Coordinating with other presenters Presentation length: ~20 minutes HW1

SLWWW at the NTCIR-13 WWW Task Peng XIAO , Yimeng FAN , Lingtao Li, Tetsuya Sakai Waseda

RelevanceFeedback CISC489/689010,Lecture#15 Monday,April13 th

How to Make Beautiful Technical Documents with LaTeX PHYS 87 Benjam n Grinstein UCSD

IB SL PAPER 1 Document Based Question Paper 1 1 Hour in Length 30% of IB-SL History Grade (25

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014

Exam Preparation Advice (for ANLP and in general) Shay Cohen (based on slides by Sharon

Tactical Analysis Board Games Marco Chiarandini Department of Mathematics & Computer Science

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science