Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Joins → Aggregates → Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School University of Warsaw November 24, 2018 1 / 1

Acknowledgements Some work reported in this course has been done in the context of the FDB project, LogicBlox, and RelationalAI by Zavodn´ y, Schleich, Kara, Nikolic, Zhang, Ciucanu, and Olteanu (Oxford) Abo Khamis and Ngo (RelationalAI), Nguyen (U. Michigan) Some of the following slides are derived from presentations by Aref (motivation) Abo Khamis (optimization diagrams) Kara (covers, IVM ǫ , and many graphics) Ngo (functional aggregate queries) Schleich (performance and quizzes) Lastly, Kara and Schleich proofread the slides. I would like to thank them for their support! 2 / 1

Goal of This Course Introduction to a principled approach to in-database computation This course starts where mainstream database courses finish. Part 1: Joins Part 2: Aggregates Part 3: Optimization ◮ Learning models inside vs outside the database ◮ From learning to factorized aggregate computation ◮ Learning under functional dependencies ◮ In-database linear algebra: Decompositions of matrices defined by joins 3 / 1

Outline of Part 3: Optimization 4 / 1

AI/ML: The Next Big Opportunity AI is emerging as general purpose technology ◮ Just as computing became general purpose 70 years ago A core ability of intelligence is the ability to predict ◮ Convert information you have into information you need The quality of the prediction is increasing as the cost per prediction is decreasing ◮ We use more of it to solve existing problems ◮ Consumer demand forecasting ◮ We use it for new problems where it was not used before ◮ From broadcast to personalized advertising ◮ From shop-then-ship to ship-then-shop 5 / 1

Most Enterprises Rely on Relational Data for AI Models Retail: 86% relational Insurance: 83% relational Marketing: 82% relational Financial: 77% relational Source: The State of Data Science & Machine Learning 2017, Kaggle, October 2017 (based on 2017 Kaggle survey of 16,000 ML practitioners) 6 / 1

Relational Model: The Jewel in the Database Crown Last 40 years have witnessed massive adoption of the Relational Model Many human hours invested in building relational models Relational databases are rich with knowledge of the underlying domains Availability of curated data made it possible to learn from the past and to predict the future for both humans (BI) and machines (AI) 7 / 1

Current State of Affairs in Building Predictive Models Design matrix Features → Current ML technology Samples THROWS AWAY the relational structure and domain knowledge that can help build BETTER MODELS 8 / 1

Learning over Relational Databases: Revisit from First Principles 9 / 1

In-database vs. Out-of-database Learning Feature Extraction DB ML tool θ Query materialized output = design matrix Model Out-of-database learning requires: [KBY17,PRWZ17] 1. Materializing the query result 2. DBMS data export and ML tool import 3. One/multi-hot encoding of categorical variables 10 / 1

In-database vs. Out-of-database Learning Feature Extraction DB ML tool θ Query materialized output = design matrix Model Out-of-database learning requires: [KBY17,PRWZ17] 1. Materializing the query result 2. DBMS data export and ML tool import 3. One/multi-hot encoding of categorical variables All these steps are very expensive and unnecessary! 10 / 1

In-database vs. Out-of-database Learning [ANNOS18a+b] Feature Extraction DB ML Tool θ Query materialized output = design matrix Model Optimized Model Reformulation Query+Aggregates Optimization Factorized Query Evaluation In-database learning exploits the query structure, the database schema, and the constraints. 11 / 1

Aggregation is the Aspiring to All Problems [SOANN19] Model # Features # Aggregates Supervised: Regression O ( n 2 ) Linear regression n O ( n d ) O ( n 2 d ) Polynomial regression degree d O ( n d ) O ( n 2 d ) Factorization machines degree d Supervised: Classification Decision tree ( k nodes) n O ( k · n · p · c ) ( c conditions/feature, p categories/label) Unsupervised k -means (const approx) n O ( k · n ) O ( k · n 2 ) PCA (rank k ) n O ( n 2 ) Chow-Liu tree n 12 / 1

Does This Matter in Practice? A Retailer Use Case Relation Cardinality Arity (Keys+Values) File Size (CSV) Inventory 84,055,817 3 + 1 2 GB Items 5,618 1 + 4 129 KB Stores 1,317 1 + 14 139 KB Demographics 1,302 1 + 15 161 KB Weather 1,159,457 2 + 6 33 MB 2.1 GB 13 / 1

Out-of-Database Solution: PostgreSQL+TensorFlow Train a linear regression model to predict inventory units Design matrix defined by the natural join of all relations, where the join keys are removed Join of Inventory, Items, Stores, Demographics, Weather Cardinality (# rows) 84,055,817 Arity (# columns) 44 (3 + 41) Size on disk 23GB Time to compute in PostgreSQL 217 secs Time to Export from PostgreSQL 373 secs Time to learn parameters with TensorFlow ∗ > 12,000 secs TensorFlow: 1 epoch; no shuffling; 100K tuple batch; FTRL gradient descent 14 / 1

In-Database versus Out-of-Database Learning PostgreSQL+TensorFlow In-Database (Sept’18) Time Size (CSV) Time Size (CSV) Input data – 2.1 GB – 2.1 GB Join 217 secs 23 GB – – Export 373 secs 23 GB – – Aggregates – – 18 secs 37 KB GD > 12K secs – 0.5 secs – Total time > 12.5K secs 18.5 secs 15 / 1

In-Database versus Out-of-Database Learning PostgreSQL+TensorFlow In-Database (Sept’18) Time Size (CSV) Time Size (CSV) Input data – 2.1 GB – 2.1 GB Join 217 secs 23 GB – – Export 373 secs 23 GB – – Aggregates – – 18 secs 37 KB GD > 12K secs – 0.5 secs – Total time > 12.5K secs 18.5 secs > 676 × faster while 600 × more accurate (RMSE on 2% test data) [SOANN19] TensorFlow trains one model. In-Database Learning takes 0.5 sec for any extra model over a subset of the given feature set. 15 / 1

Outline of Part 3: Optimization 16 / 1

Learning Regression Models with Least Square Loss We consider here ridge linear regression � f θ ( x ) = � θ , x � = � θ f , x f � f ∈ F Training dataset D = Q ( I ), where ◮ Q ( X F ) is a feature extraction query, I is the input database ◮ D consists of tuples ( x , y ) of feature vector x and response y Parameters θ obtained by minimizing the objective function: least square loss ℓ 2 − regularizer � �� 1 λ � ( � θ , x � − y ) 2 + 2 � θ � 2 J ( θ ) = 2 2 | D | ( x , y ) ∈ D 17 / 1

Side Note: One-hot Encoding of Categorical Variables Continuous variables are mapped to scalars ◮ x unitsSold , x sales ∈ R . Categorical variables are mapped to indicator vectors ◮ country has categories vietnam and england ◮ country is then mapped to an indicator vector x country = [ x vietnam , x england ] ⊤ ∈ ( { 0 , 1 } 2 ) ⊤ . ◮ x country = [0 , 1] ⊤ for a tuple with country = ‘‘england’’ This encoding leads to wide training datasets and many 0s 18 / 1

From Optimization to SumProduct Queries We can solve θ ∗ := arg min θ J ( θ ) by repeatedly updating θ in the direction of the gradient until convergence (in more detail, Algorithm 1 in [ANNOS18a]): θ := θ − α · ∇ J ( θ ) . Model reformulation idea : Decouple data-dependent ( x , y ) computation from data-independent ( θ ) computation in the formulations of the objective J ( θ ) and its gradient ∇ J ( θ ). 19 / 1

From Optimization to SumProduct FAQs 1 ( � θ , x � − y ) 2 + λ � 2 � θ � 2 J ( θ ) = 2 2 | D | ( x , y ) ∈ D = 1 2 + λ 2 θ ⊤ Σ θ − � θ , c � + s Y 2 � θ � 2 2 ∇ J ( θ ) = Σ θ − c + λ θ , 20 / 1

From Optimization to SumProduct FAQs 1 ( � θ , x � − y ) 2 + λ � 2 � θ � 2 J ( θ ) = 2 2 | D | ( x , y ) ∈ D = 1 2 + λ 2 θ ⊤ Σ θ − � θ , c � + s Y 2 � θ � 2 2 ∇ J ( θ ) = Σ θ − c + λ θ , where matrix Σ = ( σ ij ) i , j ∈ [ | F | ] , vector c = ( c i ) i ∈ [ | F | ] , and scalar s Y are: 1 1 1 � � � x i x ⊤ y 2 σ ij = c i = y · x i s Y = j | D | | D | | D | ( x , y ) ∈ D ( x , y ) ∈ D ( x , y ) ∈ D 20 / 1

Expressing Σ , c , s Y using SumProduct FAQs � 1 ( x , y ) ∈ D x i x ⊤ 1 FAQ queries for σ ij = (w/o factor | D | ): j | D | x i , x j continuous ⇒ no free variable � � � ψ ij = a i · a j · 1 R k ( a S ( Rk ) ) f ∈ F : a f ∈ Dom ( X f ) b ∈ B : a b ∈ Dom ( X b ) k ∈ [ m ] x i categorical, x j continuous ⇒ one free variable � � � ψ ij [ a i ] = a j · 1 R k ( a S ( Rk ) ) f ∈ F −{ i } : a f ∈ Dom ( X f ) b ∈ B : a b ∈ Dom ( X b ) k ∈ [ m ] x i , x j categorical ⇒ two free variables � � � ψ ij [ a i , a j ] = 1 R k ( a S ( Rk ) ) f ∈ F −{ i , j } : a f ∈ Dom ( X f ) b ∈ B : a b ∈ Dom ( X b ) k ∈ [ m ] { R k } k ∈ [ m ] is the set of relations in the query Q ; F and B are the sets of the indices of the free and, respectively, bound variables in Q ; S ( R k ) is the set of variables of R k ; a S ( R k )) is a tuple over S ( R k )); 1 E is the Kronecker delta that evaluates to 1 (0) whenever the event E (not) holds. 21 / 1

Expressing Σ , c , s Y using SQL Queries � ( x , y ) ∈ D x i x ⊤ 1 1 Queries for σ ij = (w/o factor | D | ): j | D | 22 / 1

Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School University of Warsaw November 24, 2018 1 / 1 Acknowledgements Some work reported in this course has been done in the context of the FDB project,

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Overview DW Performance Optimization Choosing aggregates Maintaining views

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

An Envestnet for the Long Run Prepared for the Wide Moat Investing Summit By Elliot Turner, CFA

Load Scheduling of Simple Temporal Networks Under Dynamic Resource Pricing T. K. Satish Kumar, Zhi

Lecture 10: Managing Lecture 10: Managing INSE 6300/4 INSE 6300/4- -UU UU Uncertainty in

Law & Works Introduction If we are to ever get law and works

UPDATING THE REGULATORY FRAMEWORKS FOR EMBEDDED NETWORKS STAKEHOLDER WORKSHOP PRIMUS HOTEL,

A WHOLESALERS WINNING FORMULA FOR YEAR-AFTER-YEAR E-COMMERCE GROWTH Private and Confidential.

A Layered Aggregate Engine for Analytics Workloads fdbresearch.github.io relational.ai

Who are we talking about? Can you list all the actors... involved in power system operation ? and

Sambuz

Useful Links

Newsletter

Mail Us

Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School University of Warsaw November 24, 2018 1 / 1 Acknowledgements Some work reported in this course has been done in the context of the FDB project,

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&amp;7,&amp;2015 What%are%joins?

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Overview DW Performance Optimization Choosing aggregates Maintaining views

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

An Envestnet for the Long Run Prepared for the Wide Moat Investing Summit By Elliot Turner, CFA

Load Scheduling of Simple Temporal Networks Under Dynamic Resource Pricing T. K. Satish Kumar, Zhi

Lecture 10: Managing Lecture 10: Managing INSE 6300/4 INSE 6300/4- -UU UU Uncertainty in

Law &amp; Works Introduction If we are to ever get law and works

UPDATING THE REGULATORY FRAMEWORKS FOR EMBEDDED NETWORKS STAKEHOLDER WORKSHOP PRIMUS HOTEL,

A WHOLESALERS WINNING FORMULA FOR YEAR-AFTER-YEAR E-COMMERCE GROWTH Private and Confidential.

A Layered Aggregate Engine for Analytics Workloads fdbresearch.github.io relational.ai

Who are we talking about? Can you list all the actors... involved in power system operation ? and

Sambuz

Useful Links

Newsletter

Mail Us

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Law & Works Introduction If we are to ever get law and works