DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L E A R N I N G S T A T E R E P R E S E N T A T I O N S F O R Q U E R Y O P T I M I Z A T I O N W I T H D E E P R E I N F O R C E M E N T L E A R N I N G

PAPER • Learning State Representations for Query Optimization with Deep Reinforcement Learning – Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, S. Sathiya Keerthi – University of Washington , Microsoft , Criteo Research • Key Topics – Deep reinforcement learning – Query optimization GT 8803 // Fall 2018 2

RELATED LINKS • Paper - https://arxiv.org/abs/1803.08604 • J Ortiz - https://homes.cs.washington.edu/~jortiz16/ • M Balazinska - https://www.cs.washington.edu/people/faculty/magda • J Gehrke - http://www.cs.cornell.edu/johannes/ • SS Keerthi - http://www.keerthis.com/ GT 8803 // Fall 2018 3

AGENDA • Problem Overview • Background • Key Ideas • Technical Details • Experiments • Discussion GT 8803 // Fall 2018 4

TODAY’s PAPER Query Reinforcement Learning Query Cardinality Database Deep Learning GT 8803 // Fall 2018 5

Problem Overview • Query Optimization is still a difficult problem • Deep Reinforcement Learning (DRL) is an evolving approach to solve complex problems. • Can DRL be used to improve query plan optimization? GT 8803 // Fall 2018 6

PROBLEM OVERVIEW • Contribution #1 : Generate a model that determine a subquery’s cardinality • Contribution #2 : Use reinforcement learning as a Markov process to propose a query plan • Some Challenges: – State isn’t obvious like in some contexts (e.g. games) – Choosing the reward can be tricky GT 8803 // Fall 2018 7

Background: Query Optimization • Ongoing problem in database systems research • Current systems still aren’t great - Why??? – Plans must be efficient in time and resources - tradeoffs – Current DBMSs make simplified assumptions • Avoid multidimensional/complex methods – Result -> Estimation errors and poor query plans GT 8803 // Fall 2018 8

Background: Query Optimization – Join order • When join includes more than 2 relations, join time can vary depending on size of relation – Subquery optimization • group by, exists operators can often be simplified, but… • can be computationally complex to determine – Cardinality estimation • Hard to map predicates as new data comes in • Requires stats to be updated https://en.wikipedia.org/wiki/Query_optimization EXPRESS LEARNING - DATABASE MANAGEMENT SYSTEMS GT 8803 // Fall 2018 9

Background: Query Optimization • Commonly used approaches – Data sketches – Sampling – Histograms – Heuristics GT 8803 // Fall 2018 10

BACKGROUND: DEEP LEARNING • What is it? – Maps input x to output y though a series of hidden layers. – Transforms data into representations • e.g. images of cats become pixels – Hidden layers apply of series of functions – Errors decrease over time via backpropagation GT 8803 // Fall 2018 11

BACKGROUND: DEEP LEARNING • What is it good for? – Machine translation – Object detection – Winning games – Much more… GT 8803 // Fall 2018 12

BACKGROUND: DEEP LEARNING • Why? – Performs well across multiple domains – We have improved, cheaper hardware and large datasets for training – It’s good at finding patterns that aren’t obvious to humans (even domain experts) – Libraries • PyTorch, TensorFlow, Keras GT 8803 // Fall 2018 13

BACKGROUND: REINFORCEMENT LEARNING • What is it? – Agents – the learner in the model – States – condition of the environment – Actions – Inputs from the agent (based on previous learning or trial/error) – Rewards – Feedback to agent to reward (or not) http://introtodeeplearning.com/materials/2018_6S191_Lecture5.pdf GT 8803 // Fall 2018 14

BACKGROUND: REINFORCEMENT LEARNING • What is it good for? – Beating Atari games – Training autonomous vehicles, robots – Optimizing stocks, gambling, auction bids, etc. GT 8803 // Fall 2018 15

BACKGROUND: REINFORCEMENT LEARNING • Why? – May perform better than brute-force deep learning models – Agents can use trial/error or greedy approaches to optimize reward – Can be good in complex state spaces because you don’t have have to provide fully labeled outputs for the model to train on; can just provide more simpler rewards GT 8803 // Fall 2018 16

KEY IDEA Can deep reinforcement learning be used to learn query representations? GT 8803 // Fall 2018 17

SUBQUERY LEARNING VIA DRL State Transition State t State t+2 State t+1 Function "" ⋈ #$ ⋈ ⋈ Representation %&'%&(&)*+*,-) of Database Subquery Subquery ℎ " +* *,/& * Properties Representation Representation Action t Action t+1 +0*,-) Query Query # " +* *,/& * ℎ "$% operation operation %&'%&(&)*+*,-) at time t at time t+1 +* *,/& * +1 GT 8803 // Fall 2018 18

EXAMPLE select * from customers C, orders O where C.col1 = O.col1 and O.col1 <= 10 𝜏 ⋈ State t State t+1 State t+2 𝜏 ⋈ Representation of Representation of Representation of Database Properties Subquery Subquery Query operation Query operation at time t+1 at time t 𝜏 ⋈ GT 8803 // Fall 2018 19

APPROACH • Map query and database to a feature vector (Q,D) • Two options: – Transform values using deep networks and output cardinality – Recursive approach taking subquery ( h t ) and operation ( a t ) as input GT 8803 // Fall 2018 20

MORE ON APPROACH • Two options: – Transform values using deep networks and output cardinality • Needs lots of data – very sparse • Recursive approach is selected – Recursive approach taking subquery ( h t ) and operation ( a t ) as input • h t is learned by the model • Thus we have NN ST model that learns based on NN Observed and NN Init GT 8803 // Fall 2018 21

HOW TO ENCODE DATA GT 8803 // Fall 2018 22

TECHNICAL DETAIL Prior subquery representation Use cardinality as an observed variable Query Operation GT 8803 // Fall 2018 23

STEPS • NN Init = 𝑔 𝑦 % , 𝑏 % x = database properties (min/max values, # distinct values, 1D histogram) a = single relational operator ( = ≠ < > ≤ ≥ ) • NN ST = 𝑔 ℎ 1 , 𝑏 1 h = latent representation of model itself (a subquery) ⋈ a = single relational operation ( ) • NN Observed Mapping from hidden state to observed variables at time t GT 8803 // Fall 2018 24

Experiments • Uses IMDB dataset – 3 GB – Real data (has skew and correlations between columns) • TensorFlow (Python) • Baseline estimates against SQL Server GT 8803 // Fall 2018 25

Experiment #1 • Train init function with properties from IMDB • 20K queries (15K train/5K test) • Model uses stochastic gradient descent (SGD) • Learning rate of .01 • 50 hidden nodes in hidden layer GT 8803 // Fall 2018 26

Experiment #1 • Fewer epochs == greater errors • m=3, 6 th epoch similar to SQL Server • > 6 th epoch, outperforms SQL Server • Greater cardinality == longer to converge (outperforms SQL Server by 9 th epoch) GT 8803 // Fall 2018 27

Experiment #1 GT 8803 // Fall 2018 28

EXPERIMENT #2 • Combined models • Select and join operation – Where a is the join ( ) ⋈ • Hidden state is able to store enough info to predict cardinality GT 8803 // Fall 2018 29

EXPERIMENT #2 GT 8803 // Fall 2018 30

NEXT STEPS Can subquery representations be used to build query plans? GT 8803 // Fall 2018 31

GOAL • Given a database D and a query Q , train a model that can learn to predict subquery cardinalities (and the best join)… Action 1 ⋈ 𝐵 State t+1 Action 2 ? ⋈ 𝐶 Subquery Representation Action 3 ⋈ 𝐷 GT 8803 // Fall 2018 32

ASSUMPTIONS • Model-free environment where probabilities between states are unknown • Each state encodes operations that have already been done • The model needs a good reward to be successful • Need to determine optimal policy GT 8803 // Fall 2018 33

Example GT 8803 // Fall 2018 34

APPROACH • For all relations in a database, assume a set of relations with attributes • Vector at time t, represents equi-join predicates and 1D selection predicates – e.g. if a predicate exists, set value to 1, otherwise 0 GT 8803 // Fall 2018 35

How to reward? • Can be given at each state or at the end. • Option 1: – Minimize cost based on existing query estimators • Option 2: – Use cardinality from learned model – Experimental GT 8803 // Fall 2018 36

Q-LEARNING • Init with random values • For each state, the next value of Q comes from: – Current value of Q – Learning rate – Reward – Max value for a reward given a greedy policy – Discount factor GT 8803 // Fall 2018 37

Q-Learning https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning- 14ac0b4493cc GT 8803 // Fall 2018 38

OPEN PROBLEMS • How to choose rewards? • State space is large and Q-learning can be impractical – Need to approximate solutions GT 8803 // Fall 2018 39

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L E A R N I N G S T A T E R E P R E S E N T A T I O N S F O R Q U E R Y O P T I M I Z A T I O N W I T H D E E P R E I N F O R C E M E N T L E

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

A Tutorial on A Tutorial on SQL Server 2005 SQL Server 2005 CMPT 354 CMPT 354 Fall 2007 Fall

Declarative Languages columns and rows name and a type 38 122 Berkeley A row has a value 42

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Module 15: Managing Transactions and Locks Overview Introduction to Transactions and Locks

POIR 613: Computational Social Science Pablo Barber a School of International Relations

Accelerate GDPR compliance with the Microsoft Cloud Henrik Mnsted Cloud Solutions Architect

CSE 154 LECTURE 22:RELATIONAL DATABASES AND SQL Relational databases relational database : A

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L E A R N I N G S T A T E R E P R E S E N T A T I O N S F O R Q U E R Y O P T I M I Z A T I O N W I T H D E E P R E I N F O R C E M E N T L E

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

A Tutorial on A Tutorial on SQL Server 2005 SQL Server 2005 CMPT 354 CMPT 354 Fall 2007 Fall

Declarative Languages columns and rows name and a type 38 122 Berkeley A row has a value 42

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Module 15: Managing Transactions and Locks Overview Introduction to Transactions and Locks

POIR 613: Computational Social Science Pablo Barber a School of International Relations

Accelerate GDPR compliance with the Microsoft Cloud Henrik Mnsted Cloud Solutions Architect

CSE 154 LECTURE 22:RELATIONAL DATABASES AND SQL Relational databases relational database : A

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues